Python Forum
Python requests.get() returns broken source code instead of expected source code?
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Python requests.get() returns broken source code instead of expected source code?
#1
Made a request on the above Wikipedia page. Specifically I need to scrape "results matrix" from https://en.wikipedia.org/wiki/2017%E2%80...ga#Results

selectedSeasonPage = requests.get('https://en.wikipedia.org/wiki/2017–18_La_Liga', features='html5lib')
Doing pprint.pprint(selectedSeasonPage.text) and jumping to source code of matrix, it can be seen it's incomplete.

Snippet of HTML returned by requests.get():

<table class="wikitable plainrowheaders" style="text-align:center;font-size:100%;">
.
.
<th scope="row" style="text-align:right;"><a href="/wiki/Deportivo_Alav%C3%A9s" title="Deportivo Alavés">Alavés</a></th>
<td style="font-weight: normal;background-color:transparent;">— </td>
<td style="white-space:nowrap;font-weight: normal;background-color:transparent;"></td>
<td style="white-space:nowrap;font-weight: normal;background-color:transparent;"></td>
<td style="white-space:nowrap;font-weight: normal;background-color:transparent;"></td>
<td style="white-space:nowrap;font-weight: normal;background-color:transparent;"></td>
<td style="white-space:nowrap;font-weight: normal;background-color:transparent;"></td>
<td style="white-space:nowrap;font-weight: normal;background-color:#BBF3FF;">2–1</td>
HTML returned by requests.get() viewed through browser and as expected its not complete. Can check this image for reference.

Snippet from view-source and the output needed:

<table class="wikitable plainrowheaders" style="text-align:center;font-size:100%;">
.
.
<a href="/wiki/Deportivo_Alav%C3%A9s" title="Deportivo Alavés">Alavés</a></th>
<td style="font-weight: normal;background-color:transparent;">—</td>
<td style="white-space:nowrap;font-weight: normal;background-color:#BBF3FF;">3–1</td>
<td style="white-space:nowrap;font-weight: normal;background-color:#FFBBBB;">0–1</td>
<td style="white-space:nowrap;font-weight: normal;background-color:#FFBBBB;">0–2</td>
<td style="white-space:nowrap;font-weight: normal;background-color:#BBF3FF;">2–1</td>
<td style="white-space:nowrap;font-weight: normal;background-color:#BBF3FF;">1–0</td>
<td style="white-space:nowrap;font-weight: normal;background-color:#FFBBBB;">1–2</td>
Posting a sample HTML for reference since posting entire output is not possible. Can post more specific parts if required.

My question is how to get entire source of matrix without resulting in loss of values?

From what I understand going through previous questions, requests fails to return expected output if some part of page is rendered by JavaScript. But this page seems to be simple HTML and CSS (at least the part that is required). Cannot use Selenium since I need to scrape multiple pages. Would be grateful for solution using requests or something equivalent.

Requests version is 2.19.1. Python version is 3.7.0.

Is anything missing? I am new to this stuff, any help appreciated.

Cross posting from:
https://stackoverflow.com/questions/5242...ource-code
Reply
#2
I'm not getting the same results as you...

E:\Projects\etc>python
Python 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018, 04:06:47) [MSC v.1914 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import requests
>>> import bs4
>>> import pprint
>>> page = requests.get('https://en.wikipedia.org/wiki/2017-18_La_Liga')
>>> soup = bs4.BeautifulSoup(page.text)
>>> pprint.pprint(soup.select("table.wikitable")[5])
<table class="wikitable plainrowheaders" style="text-align:center;font-size:100%;">
<tbody><tr>
<th scope="col">Home \ Away
</th>
<th scope="col" width="28"><a href="/wiki/Deportivo_Alav%C3%A9s" title="Deportivo Alavés">ALA</a>
</th>
<th scope="col" width="28"><a href="/wiki/Athletic_Bilbao" title="Athletic Bilbao">ATH</a>
</th>
<th scope="col" width="28"><a href="/wiki/Atl%C3%A9tico_Madrid" title="Atlético Madrid">ATM</a>
</th>
<th scope="col" width="28"><a href="/wiki/FC_Barcelona" title="FC Barcelona">BAR</a>
</th>
<th scope="col" width="28"><a href="/wiki/RC_Celta_de_Vigo" title="RC Celta de Vigo">CEL</a>
</th>
<th scope="col" width="28"><span class="nowrap"><a href="/wiki/Deportivo_de_La_Coru%C3%B1a" title="Deportivo de La Coruña">DEP</a></span>
</th>
<th scope="col" width="28"><a href="/wiki/SD_Eibar" title="SD Eibar">EIB</a>
</th>
<th scope="col" width="28"><a href="/wiki/RCD_Espanyol" title="RCD Espanyol">ESP</a>
</th>
<th scope="col" width="28"><a href="/wiki/Getafe_CF" title="Getafe CF">GET</a>
</th>
<th scope="col" width="28"><a href="/wiki/Girona_FC" title="Girona FC">GIR</a>
</th>
<th scope="col" width="28"><a href="/wiki/UD_Las_Palmas" title="UD Las Palmas">LPA</a>
</th>
<th scope="col" width="28"><a href="/wiki/CD_Legan%C3%A9s" title="CD Leganés">LEG</a>
</th>
<th scope="col" width="28"><a href="/wiki/Levante_UD" title="Levante UD">LEV</a>
</th>
<th scope="col" width="28"><a href="/wiki/M%C3%A1laga_CF" title="Málaga CF">MGA</a>
</th>
<th scope="col" width="28"><a href="/wiki/Real_Betis" title="Real Betis">BET</a>
</th>
<th scope="col" width="28"><a href="/wiki/Real_Madrid_C.F." title="Real Madrid C.F.">RMA</a>
</th>
<th scope="col" width="28"><a href="/wiki/Real_Sociedad" title="Real Sociedad">RSO</a>
</th>
<th scope="col" width="28"><a href="/wiki/Sevilla_FC" title="Sevilla FC">SEV</a>
</th>
<th scope="col" width="28"><a href="/wiki/Valencia_CF" title="Valencia CF">VAL</a>
</th>
<th scope="col" width="28"><a href="/wiki/Villarreal_CF" title="Villarreal CF">VIL</a>
</th></tr>
<tr>
<th scope="row" style="text-align:right;"><a href="/wiki/Deportivo_Alav%C3%A9s" title="Deportivo Alavés">Alavés</a>
</th>
<td style="font-weight: normal;background-color:transparent;">—
</td>
<td style="white-space:nowrap;font-weight: normal;background-color:#BBF3FF;">3–1
</td>
<td style="white-space:nowrap;font-weight: normal;background-color:#FFBBBB;">0–1
</td>
<td style="white-space:nowrap;font-weight: normal;background-color:#FFBBBB;">0–2
</td>
<td style="white-space:nowrap;font-weight: normal;background-color:#BBF3FF;">2–1
</td>
<td style="white-space:nowrap;font-weight: normal;background-color:#BBF3FF;">1–0
</td>
<td style="white-space:nowrap;font-weight: normal;background-color:#FFBBBB;">1–2
</td>
<td style="white-space:nowrap;font-weight: normal;background-color:#BBF3FF;">1–0

...etc
Reply
#3
Thank you for reading and taking time to reply. After successfully executing the code you posted and reproducing the same output, I realized the bug was in other part of my script.
To summarize, I was getting link to page from scraping https://en.wikipedia.org/wiki/Category:La_Liga_seasons and using zip() to create a dictionary of season names and links to their wiki pages. zip() truncated the larger of list which resulted in wrong page link being associated with season name key. Hence the wrong output on my side.

Thank you so much again. Wasted so much time stuck there. This was stupid mistake on my side. Only realized this after I saw your reply above. Problem is solved.
Reply
#4
That's the way it works sometimes. We get so pigeonholed looking at something for too long, and the issue ends up being something completely unrelated.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Populating list items to html code and create individualized html code files ChainyDaisy 0 1,560 Sep-21-2022, 07:18 PM
Last Post: ChainyDaisy
  can you please help me with this python code MetsxxFan01 2 2,109 Apr-27-2022, 10:44 PM
Last Post: MetsxxFan01
  POST requests - different requests return the same response Default_001 3 1,900 Mar-10-2022, 11:26 PM
Last Post: Default_001
  Hide source code from python process itself xmghe 2 1,829 Jan-27-2021, 04:04 PM
Last Post: xmghe
  Scraping Whole Page Source GJG 1 2,101 Jan-13-2021, 03:19 PM
Last Post: GJG
  Code example for entering input in a textbox with requests/selenium object peterjv26 1 1,674 Sep-26-2020, 04:34 PM
Last Post: Larz60+
  Problem with logging in on website - python w/ requests GoldeNx 6 5,208 Sep-25-2020, 10:52 AM
Last Post: snippsat
  Optimizing Or Better Python COde samlee916 1 1,758 Jul-13-2020, 03:00 PM
Last Post: Gribouillis
  How to perform a successful login(signin) through Requests in Python Kalet 1 2,304 Apr-24-2020, 01:44 AM
Last Post: Larz60+
  scraping from a website that hides source code PIWI_Protein 1 1,931 Mar-27-2020, 05:08 PM
Last Post: Larz60+

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020