Hello,
I need BS to work on a book formated as XHTML.
Each page is a <div>.
Within each page, I need to grab the footnotes, that can contain either just plain text, or one of more <i> sub-elements.
The following code does grab the plain footnotes, but ignores those that contain italics. Why is that?
Thank you.
I need BS to work on a book formated as XHTML.
Each page is a <div>.
Within each page, I need to grab the footnotes, that can contain either just plain text, or one of more <i> sub-elements.
The following code does grab the plain footnotes, but ignores those that contain italics. Why is that?
Thank you.
with open("input.xhtml", mode='rb') as file: fileContent = file.read() soup = BS(fileContent, features="xml") """ <div id="page14"><p>18 <i>Some chapter</i></p> text body <p>1. footnote</p> <p>2. footnote <i>blah</i>, blah</p> </div> """ #TODO extract page number only divs = soup.find_all('div', id=re.compile(r"^page\d+$")) for div in divs: #Why ignored if contains sub-elements, eg. "<p>4. Some note <i>some sub-element</i> Blah, 2003.</p>" ? ps = div.find_all("p", string=re.compile(r"^\d+\. ")) for p in ps: print(p.string)