How to read what's written in THIS specific page ? - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html) +--- Thread: How to read what's written in THIS specific page ? (/thread-12805.html) |
How to read what's written in THIS specific page ? - pfpietro - Sep-13-2018 Hello everyone. I've been struggling with this problem for a while. The solution I found was copying and pasting into python manually the contents of the following page: http://greyhoundbet.racingpost.com/#card/race_id=1638926&r_date=2018-09-13&tab=form The information in the page is pretty simple. But as you can see, the from the source code, the page is in JavaScript or CSS or something. So I wasn't able to read it with from urllib.request import link = "https://blablalbla" f = urlopen(link) myfile = f.read() print(myfile)I get the error: File "C:\Program Files (x86)\Python37-32\lib\urllib\request.py", line 1319, in do_open raise URLError(err) urllib.error.URLError: <urlopen error [Errno 11004] getaddrinfo failed> So, to be clear: What you experienced programmers would recommend to read the numbers in that page into a string that I could handle later? Thank you very much! RE: How to read what's written in THIS specific page ? - stranac - Sep-13-2018 In general, there are 2 common approaches to scraping a website that uses javascript:
The former requires less work, as you just load the website in a browser and deal with the resulting HTML. The latter requires some digging, but it usually results in more efficient code, and it doesn't require you to run a full browser. For this particular page, I used my browser's dev tools to find an XHR request that loads the data. Knowing where the data comes from makes getting the information as simple as making a single request (using requests): >>> r = requests.get( ... 'http://greyhoundbet.racingpost.com/card/blocks.sd?race_id=1638926&r_date=2018-09-13&blocks=form', ... headers={ ... 'User-Agent': 'Mozilla/5.0', ... } ... ) >>> data = r.json() >>> [dog['dogName'] for dog in data['form']['dogs']] ['Lobors Ferrett', 'Cairns Cilla', 'Power Diva', 'Artic Image', 'Millbank Gem', 'Market Centre'] |