Python requests.get() returns broken source code instead of expected source code?

FatalPythonError · Sep-20-2018, 06:05 PM

Made a request on the above Wikipedia page. Specifically I need to scrape "results matrix" from https://en.wikipedia.org/wiki/2017%E2%80...ga#Results

selectedSeasonPage = requests.get('https://en.wikipedia.org/wiki/2017–18_La_Liga', features='html5lib')

Doing pprint.pprint(selectedSeasonPage.text) and jumping to source code of matrix, it can be seen it's incomplete.

Snippet of HTML returned by requests.get():

<table class="wikitable plainrowheaders" style="text-align:center;font-size:100%;">
.
.
<th scope="row" style="text-align:right;"><a href="/wiki/Deportivo_Alav%C3%A9s" title="Deportivo Alavés">Alavés</a></th>
<td style="font-weight: normal;background-color:transparent;">— </td>
<td style="white-space:nowrap;font-weight: normal;background-color:transparent;"></td>
<td style="white-space:nowrap;font-weight: normal;background-color:transparent;"></td>
<td style="white-space:nowrap;font-weight: normal;background-color:transparent;"></td>
<td style="white-space:nowrap;font-weight: normal;background-color:transparent;"></td>
<td style="white-space:nowrap;font-weight: normal;background-color:transparent;"></td>
<td style="white-space:nowrap;font-weight: normal;background-color:#BBF3FF;">2–1</td>

HTML returned by requests.get() viewed through browser and as expected its not complete. Can check this image for reference.

Snippet from view-source and the output needed:

<table class="wikitable plainrowheaders" style="text-align:center;font-size:100%;">
.
.
<a href="/wiki/Deportivo_Alav%C3%A9s" title="Deportivo Alavés">Alavés</a></th>
<td style="font-weight: normal;background-color:transparent;">—</td>
<td style="white-space:nowrap;font-weight: normal;background-color:#BBF3FF;">3–1</td>
<td style="white-space:nowrap;font-weight: normal;background-color:#FFBBBB;">0–1</td>
<td style="white-space:nowrap;font-weight: normal;background-color:#FFBBBB;">0–2</td>
<td style="white-space:nowrap;font-weight: normal;background-color:#BBF3FF;">2–1</td>
<td style="white-space:nowrap;font-weight: normal;background-color:#BBF3FF;">1–0</td>
<td style="white-space:nowrap;font-weight: normal;background-color:#FFBBBB;">1–2</td>

Posting a sample HTML for reference since posting entire output is not possible. Can post more specific parts if required.

My question is how to get entire source of matrix without resulting in loss of values?

From what I understand going through previous questions, requests fails to return expected output if some part of page is rendered by JavaScript. But this page seems to be simple HTML and CSS (at least the part that is required). Cannot use Selenium since I need to scrape multiple pages. Would be grateful for solution using requests or something equivalent.

Requests version is 2.19.1. Python version is 3.7.0.

Is anything missing? I am new to this stuff, any help appreciated.

Cross posting from:
https://stackoverflow.com/questions/5242...ource-code

**nilamo** · Sep-20-2018, 07:19 PM

I'm not getting the same results as you...

E:\Projects\etc>python
Python 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018, 04:06:47) [MSC v.1914 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import requests
>>> import bs4
>>> import pprint
>>> page = requests.get('https://en.wikipedia.org/wiki/2017-18_La_Liga')
>>> soup = bs4.BeautifulSoup(page.text)
>>> pprint.pprint(soup.select("table.wikitable")[5])
<table class="wikitable plainrowheaders" style="text-align:center;font-size:100%;">
<tbody><tr>
<th scope="col">Home \ Away
</th>
<th scope="col" width="28"><a href="/wiki/Deportivo_Alav%C3%A9s" title="Deportivo Alavés">ALA</a>
</th>
<th scope="col" width="28"><a href="/wiki/Athletic_Bilbao" title="Athletic Bilbao">ATH</a>
</th>
<th scope="col" width="28"><a href="/wiki/Atl%C3%A9tico_Madrid" title="Atlético Madrid">ATM</a>
</th>
<th scope="col" width="28"><a href="/wiki/FC_Barcelona" title="FC Barcelona">BAR</a>
</th>
<th scope="col" width="28"><a href="/wiki/RC_Celta_de_Vigo" title="RC Celta de Vigo">CEL</a>
</th>
<th scope="col" width="28"><span class="nowrap"><a href="/wiki/Deportivo_de_La_Coru%C3%B1a" title="Deportivo de La Coruña">DEP</a></span>
</th>
<th scope="col" width="28"><a href="/wiki/SD_Eibar" title="SD Eibar">EIB</a>
</th>
<th scope="col" width="28"><a href="/wiki/RCD_Espanyol" title="RCD Espanyol">ESP</a>
</th>
<th scope="col" width="28"><a href="/wiki/Getafe_CF" title="Getafe CF">GET</a>
</th>
<th scope="col" width="28"><a href="/wiki/Girona_FC" title="Girona FC">GIR</a>
</th>
<th scope="col" width="28"><a href="/wiki/UD_Las_Palmas" title="UD Las Palmas">LPA</a>
</th>
<th scope="col" width="28"><a href="/wiki/CD_Legan%C3%A9s" title="CD Leganés">LEG</a>
</th>
<th scope="col" width="28"><a href="/wiki/Levante_UD" title="Levante UD">LEV</a>
</th>
<th scope="col" width="28"><a href="/wiki/M%C3%A1laga_CF" title="Málaga CF">MGA</a>
</th>
<th scope="col" width="28"><a href="/wiki/Real_Betis" title="Real Betis">BET</a>
</th>
<th scope="col" width="28"><a href="/wiki/Real_Madrid_C.F." title="Real Madrid C.F.">RMA</a>
</th>
<th scope="col" width="28"><a href="/wiki/Real_Sociedad" title="Real Sociedad">RSO</a>
</th>
<th scope="col" width="28"><a href="/wiki/Sevilla_FC" title="Sevilla FC">SEV</a>
</th>
<th scope="col" width="28"><a href="/wiki/Valencia_CF" title="Valencia CF">VAL</a>
</th>
<th scope="col" width="28"><a href="/wiki/Villarreal_CF" title="Villarreal CF">VIL</a>
</th></tr>
<tr>
<th scope="row" style="text-align:right;"><a href="/wiki/Deportivo_Alav%C3%A9s" title="Deportivo Alavés">Alavés</a>
</th>
<td style="font-weight: normal;background-color:transparent;">—
</td>
<td style="white-space:nowrap;font-weight: normal;background-color:#BBF3FF;">3–1
</td>
<td style="white-space:nowrap;font-weight: normal;background-color:#FFBBBB;">0–1
</td>
<td style="white-space:nowrap;font-weight: normal;background-color:#FFBBBB;">0–2
</td>
<td style="white-space:nowrap;font-weight: normal;background-color:#BBF3FF;">2–1
</td>
<td style="white-space:nowrap;font-weight: normal;background-color:#BBF3FF;">1–0
</td>
<td style="white-space:nowrap;font-weight: normal;background-color:#FFBBBB;">1–2
</td>
<td style="white-space:nowrap;font-weight: normal;background-color:#BBF3FF;">1–0

...etc

FatalPythonError · (This post was last modified: Sep-21-2018, 11:16 AM by FatalPythonError.)

Thank you for reading and taking time to reply. After successfully executing the code you posted and reproducing the same output, I realized the bug was in other part of my script.
To summarize, I was getting link to page from scraping https://en.wikipedia.org/wiki/Category:La_Liga_seasons and using zip() to create a dictionary of season names and links to their wiki pages. zip() truncated the larger of list which resulted in wrong page link being associated with season name key. Hence the wrong output on my side.

Thank you so much again. Wasted so much time stuck there. This was stupid mistake on my side. Only realized this after I saw your reply above. Problem is solved.

**nilamo** · Sep-21-2018, 02:46 PM

That's the way it works sometimes. We get so pigeonholed looking at something for too long, and the issue ends up being something completely unrelated.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Populating list items to html code and create individualized html code files	ChainyDaisy	0	1,597	Sep-21-2022, 07:18 PM Last Post: ChainyDaisy
	can you please help me with this python code	MetsxxFan01	2	2,146	Apr-27-2022, 10:44 PM Last Post: MetsxxFan01
	POST requests - different requests return the same response	Default_001	3	1,953	Mar-10-2022, 11:26 PM Last Post: Default_001
	Hide source code from python process itself	xmghe	2	1,884	Jan-27-2021, 04:04 PM Last Post: xmghe
	Scraping Whole Page Source	GJG	1	2,154	Jan-13-2021, 03:19 PM Last Post: GJG
	Code example for entering input in a textbox with requests/selenium object	peterjv26	1	1,717	Sep-26-2020, 04:34 PM Last Post: Larz60+
	Problem with logging in on website - python w/ requests	GoldeNx	6	5,344	Sep-25-2020, 10:52 AM Last Post: snippsat
	Optimizing Or Better Python COde	samlee916	1	1,805	Jul-13-2020, 03:00 PM Last Post: Gribouillis
	How to perform a successful login(signin) through Requests in Python	Kalet	1	2,355	Apr-24-2020, 01:44 AM Last Post: Larz60+
	scraping from a website that hides source code	PIWI_Protein	1	1,972	Mar-27-2020, 05:08 PM Last Post: Larz60+

Python requests.get() returns broken source code instead of expected source code?

User Panel Messages

Announcements