Sep-20-2018, 06:05 PM
Made a request on the above Wikipedia page. Specifically I need to scrape "results matrix" from https://en.wikipedia.org/wiki/2017%E2%80...ga#Results
Snippet of HTML returned by requests.get():
Snippet from view-source and the output needed:
My question is how to get entire source of matrix without resulting in loss of values?
From what I understand going through previous questions,
Requests version is 2.19.1. Python version is 3.7.0.
Is anything missing? I am new to this stuff, any help appreciated.
Cross posting from:
https://stackoverflow.com/questions/5242...ource-code
selectedSeasonPage = requests.get('https://en.wikipedia.org/wiki/2017–18_La_Liga', features='html5lib')Doing
pprint.pprint(selectedSeasonPage.text)
and jumping to source code of matrix, it can be seen it's incomplete.Snippet of HTML returned by requests.get():
<table class="wikitable plainrowheaders" style="text-align:center;font-size:100%;"> . . <th scope="row" style="text-align:right;"><a href="/wiki/Deportivo_Alav%C3%A9s" title="Deportivo Alavés">Alavés</a></th> <td style="font-weight: normal;background-color:transparent;">— </td> <td style="white-space:nowrap;font-weight: normal;background-color:transparent;"></td> <td style="white-space:nowrap;font-weight: normal;background-color:transparent;"></td> <td style="white-space:nowrap;font-weight: normal;background-color:transparent;"></td> <td style="white-space:nowrap;font-weight: normal;background-color:transparent;"></td> <td style="white-space:nowrap;font-weight: normal;background-color:transparent;"></td> <td style="white-space:nowrap;font-weight: normal;background-color:#BBF3FF;">2–1</td>HTML returned by requests.get() viewed through browser and as expected its not complete. Can check this image for reference.
Snippet from view-source and the output needed:
<table class="wikitable plainrowheaders" style="text-align:center;font-size:100%;"> . . <a href="/wiki/Deportivo_Alav%C3%A9s" title="Deportivo Alavés">Alavés</a></th> <td style="font-weight: normal;background-color:transparent;">—</td> <td style="white-space:nowrap;font-weight: normal;background-color:#BBF3FF;">3–1</td> <td style="white-space:nowrap;font-weight: normal;background-color:#FFBBBB;">0–1</td> <td style="white-space:nowrap;font-weight: normal;background-color:#FFBBBB;">0–2</td> <td style="white-space:nowrap;font-weight: normal;background-color:#BBF3FF;">2–1</td> <td style="white-space:nowrap;font-weight: normal;background-color:#BBF3FF;">1–0</td> <td style="white-space:nowrap;font-weight: normal;background-color:#FFBBBB;">1–2</td>Posting a sample HTML for reference since posting entire output is not possible. Can post more specific parts if required.
My question is how to get entire source of matrix without resulting in loss of values?
From what I understand going through previous questions,
requests
fails to return expected output if some part of page is rendered by JavaScript. But this page seems to be simple HTML and CSS (at least the part that is required). Cannot use Selenium since I need to scrape multiple pages. Would be grateful for solution using requests
or something equivalent.Requests version is 2.19.1. Python version is 3.7.0.
Is anything missing? I am new to this stuff, any help appreciated.
Cross posting from:
https://stackoverflow.com/questions/5242...ource-code