Python Forum
How can I get the Middle English and Modern English from this page?
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
How can I get the Middle English and Modern English from this page?
#1
I want to get the text of this page because it contains Middle English and Modern English and my Middle English is not so good!

Using the code below, I end up with the text, but most lines are empty. It has 6766 lines!

That is, all the content of p html tags is missing!

Also, when I check the webpage source code, the p html tags are all empty!

Very grateful for any tips!

import requests
from bs4 import BeautifulSoup

url = 'https://chaucer.fas.harvard.edu/pages/knights-tale-0'
res = requests.get(url)
html_page = res.content
soup = BeautifulSoup(html_page, 'html.parser')
text = soup.find_all(text=True)

output = ''
blacklist = [ '[document]', 'noscript',  'header', 'html', 'meta', 'head', 'input', 'script', 'style']

for t in text:
    if t.parent.name not in blacklist:
        output += '{} '.format(t)

print(output)
Reply


Messages In This Thread
How can I get the Middle English and Modern English from this page? - by Pedroski55 - Feb-02-2022, 04:45 AM

Possibly Related Threads…
Thread Author Replies Views Last Post
  use Xpath in Python :: libxml2 for a page-to-page skip-setting apollo 2 3,717 Mar-19-2020, 06:13 PM
Last Post: apollo

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020