Posts: 212
Threads: 94
Joined: Aug 2018
Sep-02-2024, 05:38 PM
(This post was last modified: Sep-04-2024, 09:35 AM by Winfried.)
Hello,
I need BS to work on a book formated as XHTML.
Each page is a <div>.
Within each page, I need to grab the footnotes, that can contain either just plain text, or one of more <i> sub-elements.
The following code does grab the plain footnotes, but ignores those that contain italics. Why is that?
Thank you.
with open("input.xhtml", mode='rb') as file:
fileContent = file.read()
soup = BS(fileContent, features="xml")
"""
<div id="page14"><p>18 <i>Some chapter</i></p>
text body
<p>1. footnote</p>
<p>2. footnote <i>blah</i>, blah</p>
</div>
"""
#TODO extract page number only
divs = soup.find_all('div', id=re.compile(r"^page\d+$"))
for div in divs:
#Why ignored if contains sub-elements, eg. "<p>4. Some note <i>some sub-element</i> Blah, 2003.</p>" ?
ps = div.find_all("p", string=re.compile(r"^\d+\. "))
for p in ps:
print(p.string)
Posts: 7,320
Threads: 123
Joined: Sep 2016
Sep-02-2024, 07:10 PM
(This post was last modified: Sep-02-2024, 07:10 PM by snippsat.)
See if this helps.
Look at CSS Selectors that what i use here,
it's powerful and many forget that BS has support for this.
from bs4 import BeautifulSoup
html = """\
<div id="page14"><p>18 <i>Some chapter</i></p>
text body
<p>1. footnote</p>
<p>2. footnote <i>blah</i>, blah</p>
</div>"""
soup = BeautifulSoup(html, 'lxml')
p_tag = soup.select('div > p')
for tag in p_tag:
print(tag.text) Output: 18 Some chapter
1. footnote
2. footnote blah, bla
Example only <i> inside <p> tag.
from bs4 import BeautifulSoup
html = """\
<div id="page14"><p>18 <i>Some chapter</i></p>
text body
<p>1. footnote</p>
<p>2. footnote <i>blah</i>, blah</p>
</div>"""
soup = BeautifulSoup(html, 'lxml')
p_tag = soup.select('p i')
for tag in p_tag:
print(tag.text) Output: Some chapter
blah
One specific tag.
>>> soup.select_one('p:nth-child(3)')
<p>2. footnote <i>blah</i>, blah</p>
>>> soup.select_one('p:nth-child(3)').text
'2. footnote blah, blah'
Posts: 212
Threads: 94
Joined: Aug 2018
Sep-03-2024, 11:29 AM
(This post was last modified: Sep-03-2024, 11:29 AM by Winfried.)
Thanks for the tip.
How to can I grab both footnotes that contain OR do not contain sub-elements — italics; I guess there are not other possible sub-elements within?
Doesn't this only grab text within italics, and ignores 1) the parent's text and 2) footnotes with no italic sub-elements?
p_tag = div.select('p i')
for tag in p_tag:
print(tag.text)
Posts: 212
Threads: 94
Joined: Aug 2018
This code misses those footnotes:
ps = div.find_all("p", string=re.compile(r"^\d+\. "))
for p in ps:
print(p.string) <p>1. <i>Blah</i>, t. 2. Blah p. 219.</p>
<p>2. Blah. <i>Blah,</i> Blah.
Blah, p. 106.</p>
Posts: 7,320
Threads: 123
Joined: Sep 2016
Sep-03-2024, 02:15 PM
(This post was last modified: Sep-03-2024, 02:15 PM by snippsat.)
(Sep-03-2024, 11:29 AM)Winfried Wrote: How to can I grab both footnotes that contain OR do not contain sub-elements — italics; I guess there are not other possible sub-elements within? That should be what my first example shown and dos?
If add some lines it grab all text inside p-tag,also text inside <i> tag.
from bs4 import BeautifulSoup
html = """\
<div id="page14"><p>18 <i>Some chapter</i></p>
text body
<p>1. footnote</p>
<p>2. footnote <i>blah</i>, blah</p>
<p>3. Bus <i>green</i> route 5555</p>
<p>4. Some note <i>some sub-element</i> Blah, 2003.</p>
<p>5. Car <i>some sub-element</i> Blah, <i>999</i>.</p>
</div>"""
soup = BeautifulSoup(html, 'lxml')
p_tag = soup.select('div > p')
for tag in p_tag:
print(tag.text) Output: 18 Some chapter
1. footnote
2. footnote blah, blah
3. Bus green route 5555
4. Some note some sub-element Blah, 2003.
5. Car some sub-element Blah, 999.
Posts: 212
Threads: 94
Joined: Aug 2018
Sep-04-2024, 09:34 AM
(This post was last modified: Sep-04-2024, 09:35 AM by Winfried.)
Got it: I was using .string while you used .text. I thought they did the same thing and one was simply deprecated, but they work differently.
Anyhow, since I do need to keep italics in the output somehow, I'll just turn them into Markdown before parsing, and turn them back into HTML when writing the file back to disk.
Thanks very much.
|