Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Simple newbie Q
#1
I'm new to python so hope this is really simple.

import bs4 as bs
import urllib .request

saurce = urllib .request.urlopen("https://www.premierleague.com/fixtures").read()
soup = bs.BeautifulSoup(saurce,'lxml')

print(soup.get_text())

I thought this returns all the text on the page. However it doesn't, it returns the text in the header and foot of the page. Much different than following the video using a different page. What should I be doig differently to get the text in the main part of the page?

Many thanks
Reply
#2
Here's a super simple example using requests (use instead of urllib2)

to get requests package:
pip install requests
import requests
from bs4 import BeautifulSoup
response = requests.get('https://www.premierleague.com/fixtures')
if response.status_code == 200:
    soup = BeautifulSoup(response.text, 'lxml')
    print(soup)
else:
    print('Problem downloading status code: {}'.format(response.status_code))
Reply
#3
Probably the page content is generated by JS. So urllib can't do anything here. Test the script with a webpage that is static. This one for example: https://fishshell.com/docs/current/tutorial.html
"As they say in Mexico 'dosvidaniya'. That makes two vidaniyas."
https://freedns.afraid.org
Reply
#4
Thanks. The code didn't do the trick so will explode the link provided.
Reply
#5
I tried it by myself. It's difficult because the page is rendered by JavaScript and the CSS-Selector I tried, did not bring any results.

There is a lib, which is using BeautifulSoap, but with a better API: https://github.com/kennethreitz/requests-html
There you have the method html.render() which should start a hidden Chronium instance in the background, to render the JavaScripts.
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020