Feb-02-2022, 04:45 AM
I want to get the text of this page because it contains Middle English and Modern English and my Middle English is not so good!
Using the code below, I end up with the text, but most lines are empty. It has 6766 lines!
That is, all the content of p html tags is missing!
Also, when I check the webpage source code, the p html tags are all empty!
Very grateful for any tips!
Using the code below, I end up with the text, but most lines are empty. It has 6766 lines!
That is, all the content of p html tags is missing!
Also, when I check the webpage source code, the p html tags are all empty!
Very grateful for any tips!
import requests from bs4 import BeautifulSoup url = 'https://chaucer.fas.harvard.edu/pages/knights-tale-0' res = requests.get(url) html_page = res.content soup = BeautifulSoup(html_page, 'html.parser') text = soup.find_all(text=True) output = '' blacklist = [ '[document]', 'noscript', 'header', 'html', 'meta', 'head', 'input', 'script', 'style'] for t in text: if t.parent.name not in blacklist: output += '{} '.format(t) print(output)