Python Forum
Simple newbie Q - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html)
+--- Thread: Simple newbie Q (/thread-9167.html)



Simple newbie Q - Darkish - Mar-24-2018

I'm new to python so hope this is really simple.

import bs4 as bs
import urllib .request

saurce = urllib .request.urlopen("https://www.premierleague.com/fixtures").read()
soup = bs.BeautifulSoup(saurce,'lxml')

print(soup.get_text())

I thought this returns all the text on the page. However it doesn't, it returns the text in the header and foot of the page. Much different than following the video using a different page. What should I be doig differently to get the text in the main part of the page?

Many thanks


RE: Simple newbie Q - Larz60+ - Mar-24-2018

Here's a super simple example using requests (use instead of urllib2)

to get requests package:
pip install requests
import requests
from bs4 import BeautifulSoup
response = requests.get('https://www.premierleague.com/fixtures')
if response.status_code == 200:
    soup = BeautifulSoup(response.text, 'lxml')
    print(soup)
else:
    print('Problem downloading status code: {}'.format(response.status_code))



RE: Simple newbie Q - wavic - Mar-24-2018

Probably the page content is generated by JS. So urllib can't do anything here. Test the script with a webpage that is static. This one for example: https://fishshell.com/docs/current/tutorial.html


RE: Simple newbie Q - Darkish - Mar-24-2018

Thanks. The code didn't do the trick so will explode the link provided.


RE: Simple newbie Q - DeaD_EyE - Mar-25-2018

I tried it by myself. It's difficult because the page is rendered by JavaScript and the CSS-Selector I tried, did not bring any results.

There is a lib, which is using BeautifulSoap, but with a better API: https://github.com/kennethreitz/requests-html
There you have the method html.render() which should start a hidden Chronium instance in the background, to render the JavaScripts.