How to read what's written in THIS specific page ?

How to read what's written in THIS specific page ? - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html)
+--- Thread: How to read what's written in THIS specific page ? (/thread-12805.html)

How to read what's written in THIS specific page ? - pfpietro - Sep-13-2018

Hello everyone.
I've been struggling with this problem for a while. The solution I found was copying and pasting into python manually the contents of the following page:

http://greyhoundbet.racingpost.com/#card/race_id=1638926&r_date=2018-09-13&tab=form

The information in the page is pretty simple.
But as you can see, the from the source code, the page is in JavaScript or CSS or something. So I wasn't able to read it with

 from urllib.request import 
link = "https://blablalbla"
f = urlopen(link)
myfile = f.read()
print(myfile)

I get the error:
File "C:\Program Files (x86)\Python37-32\lib\urllib\request.py", line 1319, in do_open
raise URLError(err)
urllib.error.URLError: <urlopen error [Errno 11004] getaddrinfo failed>

So, to be clear:

What you experienced programmers would recommend to read the numbers in that page into a string that I could handle later?

Thank you very much!

RE: How to read what's written in THIS specific page ? - stranac - Sep-13-2018

In general, there are 2 common approaches to scraping a website that uses javascript:

Opening the page using an actual web browser (e.g. using selenium)
Figuring out what the page is doing, and emulating that in your code

Both approaches are usable for your website.

The former requires less work, as you just load the website in a browser and deal with the resulting HTML.
The latter requires some digging, but it usually results in more efficient code, and it doesn't require you to run a full browser.

For this particular page, I used my browser's dev tools to find an XHR request that loads the data.
Knowing where the data comes from makes getting the information as simple as making a single request (using requests):

>>> r = requests.get(
...     'http://greyhoundbet.racingpost.com/card/blocks.sd?race_id=1638926&r_date=2018-09-13&blocks=form',
...     headers={
...         'User-Agent': 'Mozilla/5.0',
...     }
... )
>>> data = r.json()
>>> [dog['dogName'] for dog in data['form']['dogs']]
['Lobors Ferrett', 'Cairns Cilla', 'Power Diva', 'Artic Image', 'Millbank Gem', 'Market Centre']