Python Forum

Full Version: How can I get the Middle English and Modern English from this page?
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
I want to get the text of this page because it contains Middle English and Modern English and my Middle English is not so good!

Using the code below, I end up with the text, but most lines are empty. It has 6766 lines!

That is, all the content of p html tags is missing!

Also, when I check the webpage source code, the p html tags are all empty!

Very grateful for any tips!

import requests
from bs4 import BeautifulSoup

url = 'https://chaucer.fas.harvard.edu/pages/knights-tale-0'
res = requests.get(url)
html_page = res.content
soup = BeautifulSoup(html_page, 'html.parser')
text = soup.find_all(text=True)

output = ''
blacklist = [ '[document]', 'noscript',  'header', 'html', 'meta', 'head', 'input', 'script', 'style']

for t in text:
    if t.parent.name not in blacklist:
        output += '{} '.format(t)

print(output)
(Feb-02-2022, 04:45 AM)Pedroski55 Wrote: [ -> ]That is, all the content of p html tags is missing!
No, look carefully, not all of them are empty. I don't know what the empty P tags are for, but this one contains data you need. I am showing a fragment.
Output:
<p><span style="font-family:'book antiqua', palatino">859 <strong>Whilom, as olde stories tellen us,</strong></span><br /><span style="font-family:'book antiqua', palatino"> Once, as old histories tell us,</span><br /><span style="font-family:'book antiqua', palatino"> 860 <strong>Ther was a duc that highte Theseus;</strong></span><br /><span style="font-family:'book antiqua', palatino"> There was a duke who was called Theseus;</span><br /> ... <span style="font-family:'book antiqua', palatino"> 1000 <strong>But shortly for to telle is myn entente.</strong></span><br /><span style="font-family:'book antiqua', palatino"> But briefly to tell is my intent.</span></p>
You need to look inside the P tags for SPAN tags with attribute style="font-family:'book antiqua', palatino". Then the "olde storie" will be in STRONG tags. I believe you wanted to skip this.
You are right about the part at the top, and the bottom. I get that text

In between, if you look at the webpage source code, there are endless lines of html p tag pairs.

That's where the Middle English and Modern English goes. But it must be inserted by some kind of PHP, Java or Ajax trick, because the text is not present in the webpage source code.

The webpage displays the Middle English and Modern English correctly, but it is protected somehow. Can't be a copyright issue, it was written more than a thousand years ago.

Maybe the Modern English translation is protected?

Wouldn't normally be a problem, I have the Middle English text, and a Modern English translation. The problem is, as far as I know there are about 6 existing original manuscripts, which are not exactly the same.

Last night, I was checking what I have against each other. Around line 700, things start to go wrong.

That's why this webpage Middle English / Modern English would be good.
(Feb-03-2022, 09:21 AM)Pedroski55 Wrote: [ -> ]In between, if you look at the webpage source code, there are endless lines of html p tag pairs.
Indeed, I don't know what it is for, but it might be to fill up the left part of the page, under the menu, to have just as many lines as the right part, with the text.
(Feb-03-2022, 09:21 AM)Pedroski55 Wrote: [ -> ]but it is protected somehow
No, there is no protection. I can just copy and paste the text.
I tried this:
import requests
from bs4 import BeautifulSoup

page = requests.get("https://chaucer.fas.harvard.edu/pages/knights-tale-0")
soup = BeautifulSoup(page.content, 'html.parser')
for s_line in soup.find_all("span", attrs={"style": "font-family:'book antiqua', palatino"}, limit=20):
    print(s_line.text)
Output:
Iamque domos patrias, Sithice post aspera gentis prelia,laurigero, etc. [And now (Theseus drawing nigh his) native land in laurelled car after battling with the Scithian folk, etc.] 859        Whilom, as olde stories tellen us,                Once, as old histories tell us, 860        Ther was a duc that highte Theseus;                There was a duke who was called Theseus; 861        Of Atthenes he was lord and governour,                He was lord and governor of Athens, 862        And in his tyme swich a conquerour                And in his time such a conqueror 863        That gretter was ther noon under the sonne.                That there was no one greater under the sun. 864        Ful many a riche contree hadde he wonne;                Very many a powerful country had he won; 865        What with his wysdom and his chivalrie,                What with his wisdom and his chivalry, 866        He conquered al the regne of Femenye,                He conquered all the land of the Amazons, 867        That whilom was ycleped Scithia, Process finished with exit code 0
Is this a good start for you to continue?
Thank you very much! This is a wonderful start for me to continue!

I didn't scroll to the right of the webpage source code, or I might have seen the text there!

A lot of pesky \xc2\xa0 and \xa0 characters, which I do not understand to get rid of, but the text is there!

Thanks a lot!!
Thanks again for your help.

Our boss decided, as we don't have too much to do, we should give a few lectures each term. So I thought, Chaucer would be good.

Had to deal with the Latin encoding: "00a0 is actually non-breaking space in Latin1 (ISO 8859-1), also chr(160). You should replace it with a space."

This got me the text I wanted, all 4503 lines:

import requests
from bs4 import BeautifulSoup
 
url = 'https://chaucer.fas.harvard.edu/pages/knights-tale-0'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')

lines = []    
for s_line in soup.find_all("span", attrs={"style": "font-family:'book antiqua', palatino"}):
    line = s_line.text
    lines.append(line)
    
# this does the job!
# get rid of Latin encoding
newlines = []
for i in range(len(lines)):
    newline = lines[i].encode('utf')
    newline2 = newline.decode('utf')
    newline3 = newline2.replace('\xa0', '')
    newline4 = newline3 + '\n'
    newlines.append(newline4)

textstring = ''.join(newlines)
path2text = '/home/pedro/summer2022/lectures/'
name = 'knights_tale.txt'
with open(path2text + name, 'w') as kt:
    kt.write(textstring)

print('All done!')
I am very grateful!