How can I get the Middle English and Modern English from this page?

Pedroski55 · Feb-02-2022, 04:45 AM

I want to get the text of this page because it contains Middle English and Modern English and my Middle English is not so good!

Using the code below, I end up with the text, but most lines are empty. It has 6766 lines!

That is, all the content of p html tags is missing!

Also, when I check the webpage source code, the p html tags are all empty!

Very grateful for any tips!

import requests
from bs4 import BeautifulSoup

url = 'https://chaucer.fas.harvard.edu/pages/knights-tale-0'
res = requests.get(url)
html_page = res.content
soup = BeautifulSoup(html_page, 'html.parser')
text = soup.find_all(text=True)

output = ''
blacklist = [ '[document]', 'noscript',  'header', 'html', 'meta', 'head', 'input', 'script', 'style']

for t in text:
    if t.parent.name not in blacklist:
        output += '{} '.format(t)

print(output)

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	use Xpath in Python :: libxml2 for a page-to-page skip-setting	apollo	2	3,717	Mar-19-2020, 06:13 PM Last Post: apollo

How can I get the Middle English and Modern English from this page?

User Panel Messages

Announcements