[SOLVED] [BeautifulSoup] Why are some elements ignored?

Winfried · (This post was last modified: Sep-04-2024, 09:35 AM by Winfried.)

Hello,

I need BS to work on a book formated as XHTML.

Each page is a <div>.

Within each page, I need to grab the footnotes, that can contain either just plain text, or one of more sub-elements.

The following code does grab the plain footnotes, but ignores those that contain italics. Why is that?

Thank you.

with open("input.xhtml", mode='rb') as file:
  fileContent = file.read()
soup = BS(fileContent, features="xml")

"""
<div id="page14"><p>18 <i>Some chapter</i></p>
text body
<p>1. footnote</p>
<p>2. footnote <i>blah</i>, blah</p>
</div>
"""
#TODO extract page number only
divs = soup.find_all('div', id=re.compile(r"^page\d+$"))
for div in divs:
	#Why ignored if contains sub-elements, eg. "<p>4. Some note <i>some sub-element</i> Blah, 2003.</p>" ?
	ps = div.find_all("p", string=re.compile(r"^\d+\. "))
	for p in ps:
		print(p.string)

***snippsat*** · (This post was last modified: Sep-02-2024, 07:10 PM by snippsat.)

See if this helps.
Look at CSS Selectors that what i use here,
it's powerful and many forget that BS has support for this.

from bs4 import BeautifulSoup

html = """\
<div id="page14"><p>18 <i>Some chapter</i></p>
text body
<p>1. footnote</p>
<p>2. footnote <i>blah</i>, blah</p>
</div>"""

soup = BeautifulSoup(html, 'lxml')
p_tag = soup.select('div > p')
for tag in p_tag:
    print(tag.text)

Output:18 Some chapter
1. footnote
2. footnote blah, bla

Example only inside tag.

from bs4 import BeautifulSoup

html = """\
<div id="page14"><p>18 <i>Some chapter</i></p>
text body
<p>1. footnote</p>
<p>2. footnote <i>blah</i>, blah</p>
</div>"""

soup = BeautifulSoup(html, 'lxml')
p_tag = soup.select('p i')
for tag in p_tag:
    print(tag.text)

Output:Some chapter
blah

One specific tag.

>>> soup.select_one('p:nth-child(3)')
<p>2. footnote <i>blah</i>, blah</p>
>>> soup.select_one('p:nth-child(3)').text
'2. footnote blah, blah'

Winfried · (This post was last modified: Sep-03-2024, 11:29 AM by Winfried.)

Thanks for the tip.

How to can I grab both footnotes that contain OR do not contain sub-elements — italics; I guess there are not other possible sub-elements within?

Doesn't this only grab text within italics, and ignores 1) the parent's text and 2) footnotes with no italic sub-elements?

p_tag = div.select('p i')
for tag in p_tag:
	print(tag.text)

Winfried · Sep-03-2024, 11:49 AM

This code misses those footnotes:

ps = div.find_all("p", string=re.compile(r"^\d+\. "))
for p in ps:
	print(p.string)

1. Blah, t. 2. Blah p. 219.
2. Blah. Blah, Blah.
Blah, p. 106.

***snippsat*** · (This post was last modified: Sep-03-2024, 02:15 PM by snippsat.)

(Sep-03-2024, 11:29 AM)Winfried Wrote: How to can I grab both footnotes that contain OR do not contain sub-elements — italics; I guess there are not other possible sub-elements within?

That should be what my first example shown and dos?
If add some lines it grab all text inside p-tag,also text inside tag.

from bs4 import BeautifulSoup

html = """\
<div id="page14"><p>18 <i>Some chapter</i></p>
text body
<p>1. footnote</p>
<p>2. footnote <i>blah</i>, blah</p>
<p>3. Bus <i>green</i> route 5555</p>
<p>4. Some note <i>some sub-element</i> Blah, 2003.</p>
<p>5. Car <i>some sub-element</i> Blah, <i>999</i>.</p>
</div>"""

soup = BeautifulSoup(html, 'lxml')
p_tag = soup.select('div > p')
for tag in p_tag:
    print(tag.text)

Output:18 Some chapter
1. footnote
2. footnote blah, blah
3. Bus green route 5555
4. Some note some sub-element Blah, 2003.
5. Car some sub-element Blah, 999.

Winfried · (This post was last modified: Sep-04-2024, 09:35 AM by Winfried.)

Got it: I was using .string while you used .text. I thought they did the same thing and one was simply deprecated, but they work differently.

Anyhow, since I do need to keep italics in the output somehow, I'll just turn them into Markdown before parsing, and turn them back into HTML when writing the file back to disk.

Thanks very much.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	[SOLVED] [BeautifulSoup] Why attribute not found?	Winfried	0	1,348	Mar-11-2023, 10:00 PM Last Post: Winfried
	[SOLVED] [BeautifulSoup] Why does it turn inserted string's brackets into </>?	Winfried	0	2,786	Sep-03-2022, 11:21 PM Last Post: Winfried
	[SOLVED] [Beautifulsoup] Find if element exists, and edit/append?	Winfried	2	7,278	Sep-03-2022, 10:14 PM Last Post: Winfried
	[SOLVED] [BeautifulSoup] Turn select() into comma-separated string?	Winfried	0	1,978	Aug-19-2022, 08:07 PM Last Post: Winfried
	[SOLVED] [BeautifulSoup] How to get this text?	Winfried	6	3,294	Aug-17-2022, 03:58 PM Last Post: Winfried
	ValueError: Length mismatch: Expected axis has 8 elements, new values have 1 elements	ilknurg	1	8,524	May-17-2022, 11:38 AM Last Post: Larz60+
	Sorting Elements via parameters pointing to those elements.	rpalmer	3	3,568	Feb-10-2021, 04:53 PM Last Post: rpalmer

[SOLVED] [BeautifulSoup] Why are some elements ignored?

User Panel Messages

Announcements