Python Forum

Full Version: Python requests.get() returns broken source code instead of expected source code?
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Made a request on the above Wikipedia page. Specifically I need to scrape "results matrix" from https://en.wikipedia.org/wiki/2017%E2%80...ga#Results

selectedSeasonPage = requests.get('https://en.wikipedia.org/wiki/2017–18_La_Liga', features='html5lib')
Doing pprint.pprint(selectedSeasonPage.text) and jumping to source code of matrix, it can be seen it's incomplete.

Snippet of HTML returned by requests.get():

<table class="wikitable plainrowheaders" style="text-align:center;font-size:100%;">
.
.
<th scope="row" style="text-align:right;"><a href="/wiki/Deportivo_Alav%C3%A9s" title="Deportivo Alavés">Alavés</a></th>
<td style="font-weight: normal;background-color:transparent;">— </td>
<td style="white-space:nowrap;font-weight: normal;background-color:transparent;"></td>
<td style="white-space:nowrap;font-weight: normal;background-color:transparent;"></td>
<td style="white-space:nowrap;font-weight: normal;background-color:transparent;"></td>
<td style="white-space:nowrap;font-weight: normal;background-color:transparent;"></td>
<td style="white-space:nowrap;font-weight: normal;background-color:transparent;"></td>
<td style="white-space:nowrap;font-weight: normal;background-color:#BBF3FF;">2–1</td>
HTML returned by requests.get() viewed through browser and as expected its not complete. Can check this image for reference.

Snippet from view-source and the output needed:

<table class="wikitable plainrowheaders" style="text-align:center;font-size:100%;">
.
.
<a href="/wiki/Deportivo_Alav%C3%A9s" title="Deportivo Alavés">Alavés</a></th>
<td style="font-weight: normal;background-color:transparent;">—</td>
<td style="white-space:nowrap;font-weight: normal;background-color:#BBF3FF;">3–1</td>
<td style="white-space:nowrap;font-weight: normal;background-color:#FFBBBB;">0–1</td>
<td style="white-space:nowrap;font-weight: normal;background-color:#FFBBBB;">0–2</td>
<td style="white-space:nowrap;font-weight: normal;background-color:#BBF3FF;">2–1</td>
<td style="white-space:nowrap;font-weight: normal;background-color:#BBF3FF;">1–0</td>
<td style="white-space:nowrap;font-weight: normal;background-color:#FFBBBB;">1–2</td>
Posting a sample HTML for reference since posting entire output is not possible. Can post more specific parts if required.

My question is how to get entire source of matrix without resulting in loss of values?

From what I understand going through previous questions, requests fails to return expected output if some part of page is rendered by JavaScript. But this page seems to be simple HTML and CSS (at least the part that is required). Cannot use Selenium since I need to scrape multiple pages. Would be grateful for solution using requests or something equivalent.

Requests version is 2.19.1. Python version is 3.7.0.

Is anything missing? I am new to this stuff, any help appreciated.

Cross posting from:
https://stackoverflow.com/questions/5242...ource-code
I'm not getting the same results as you...

E:\Projects\etc>python
Python 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018, 04:06:47) [MSC v.1914 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import requests
>>> import bs4
>>> import pprint
>>> page = requests.get('https://en.wikipedia.org/wiki/2017-18_La_Liga')
>>> soup = bs4.BeautifulSoup(page.text)
>>> pprint.pprint(soup.select("table.wikitable")[5])
<table class="wikitable plainrowheaders" style="text-align:center;font-size:100%;">
<tbody><tr>
<th scope="col">Home \ Away
</th>
<th scope="col" width="28"><a href="/wiki/Deportivo_Alav%C3%A9s" title="Deportivo Alavés">ALA</a>
</th>
<th scope="col" width="28"><a href="/wiki/Athletic_Bilbao" title="Athletic Bilbao">ATH</a>
</th>
<th scope="col" width="28"><a href="/wiki/Atl%C3%A9tico_Madrid" title="Atlético Madrid">ATM</a>
</th>
<th scope="col" width="28"><a href="/wiki/FC_Barcelona" title="FC Barcelona">BAR</a>
</th>
<th scope="col" width="28"><a href="/wiki/RC_Celta_de_Vigo" title="RC Celta de Vigo">CEL</a>
</th>
<th scope="col" width="28"><span class="nowrap"><a href="/wiki/Deportivo_de_La_Coru%C3%B1a" title="Deportivo de La Coruña">DEP</a></span>
</th>
<th scope="col" width="28"><a href="/wiki/SD_Eibar" title="SD Eibar">EIB</a>
</th>
<th scope="col" width="28"><a href="/wiki/RCD_Espanyol" title="RCD Espanyol">ESP</a>
</th>
<th scope="col" width="28"><a href="/wiki/Getafe_CF" title="Getafe CF">GET</a>
</th>
<th scope="col" width="28"><a href="/wiki/Girona_FC" title="Girona FC">GIR</a>
</th>
<th scope="col" width="28"><a href="/wiki/UD_Las_Palmas" title="UD Las Palmas">LPA</a>
</th>
<th scope="col" width="28"><a href="/wiki/CD_Legan%C3%A9s" title="CD Leganés">LEG</a>
</th>
<th scope="col" width="28"><a href="/wiki/Levante_UD" title="Levante UD">LEV</a>
</th>
<th scope="col" width="28"><a href="/wiki/M%C3%A1laga_CF" title="Málaga CF">MGA</a>
</th>
<th scope="col" width="28"><a href="/wiki/Real_Betis" title="Real Betis">BET</a>
</th>
<th scope="col" width="28"><a href="/wiki/Real_Madrid_C.F." title="Real Madrid C.F.">RMA</a>
</th>
<th scope="col" width="28"><a href="/wiki/Real_Sociedad" title="Real Sociedad">RSO</a>
</th>
<th scope="col" width="28"><a href="/wiki/Sevilla_FC" title="Sevilla FC">SEV</a>
</th>
<th scope="col" width="28"><a href="/wiki/Valencia_CF" title="Valencia CF">VAL</a>
</th>
<th scope="col" width="28"><a href="/wiki/Villarreal_CF" title="Villarreal CF">VIL</a>
</th></tr>
<tr>
<th scope="row" style="text-align:right;"><a href="/wiki/Deportivo_Alav%C3%A9s" title="Deportivo Alavés">Alavés</a>
</th>
<td style="font-weight: normal;background-color:transparent;">—
</td>
<td style="white-space:nowrap;font-weight: normal;background-color:#BBF3FF;">3–1
</td>
<td style="white-space:nowrap;font-weight: normal;background-color:#FFBBBB;">0–1
</td>
<td style="white-space:nowrap;font-weight: normal;background-color:#FFBBBB;">0–2
</td>
<td style="white-space:nowrap;font-weight: normal;background-color:#BBF3FF;">2–1
</td>
<td style="white-space:nowrap;font-weight: normal;background-color:#BBF3FF;">1–0
</td>
<td style="white-space:nowrap;font-weight: normal;background-color:#FFBBBB;">1–2
</td>
<td style="white-space:nowrap;font-weight: normal;background-color:#BBF3FF;">1–0

...etc
Thank you for reading and taking time to reply. After successfully executing the code you posted and reproducing the same output, I realized the bug was in other part of my script.
To summarize, I was getting link to page from scraping https://en.wikipedia.org/wiki/Category:La_Liga_seasons and using zip() to create a dictionary of season names and links to their wiki pages. zip() truncated the larger of list which resulted in wrong page link being associated with season name key. Hence the wrong output on my side.

Thank you so much again. Wasted so much time stuck there. This was stupid mistake on my side. Only realized this after I saw your reply above. Problem is solved.
That's the way it works sometimes. We get so pigeonholed looking at something for too long, and the issue ends up being something completely unrelated.