Python Forum
How to get the first child of a beautifulSoup document ?
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
How to get the first child of a beautifulSoup document ?
#1
Hi

I'am a little disappointed with beautifulSoup handling a tree.

>>> from bs4 import BeautifulSoup as btfs
>>> soup=btfs('', 'html5lib')
>>> soup
<html><head></head><body></body></html>
>>> for elem in soup.children:
...     print(elem)
... 
<html><head></head><body></body></html>
>>> 
I would expect, in the preceding example, soup to have as single child <html></html> rather than <html><head></head><body></body></html>.

Apparently, using another parser than html5lib does not make any difference.

So, how could I get <html></html> as a child of soup ?

Arbiel
using Ubuntu 18.04.4 LTS, Python 3.8
having substituted «https://www.lilo.org/fr/» to google, «https://protonmail.com/» to any other unsafe mail service and bépo to azerty (french keyboard layouts)
Reply
#2
There are several ways to do this. Read the docs.
you have element.select using CSS select tag
tags can be refered to as div.a.get('href')
and others
Reply
#3
The doc about this .contents and .children.

Here a working test setup.
from bs4 import BeautifulSoup

html = '''\
<!doctype html>
<html lang="en">
<head>
  <meta charset="utf-8">
  <title>The HTML5</title>
</head>
<body>
  <tag1>
    <tag2 name="tag2">
      <tag3 name="Korea">Closed</tag3><br>
      <tag3 name="China">A Big contry</tag3><br>
      <tag3 name="Japan">Nippon</tag3><br>
    </tag2>
  </tag1>
</body>
</html>'''

soup = BeautifulSoup(html, 'lxml')
Usage test:
>>> for elem in soup.children:
...     print(elem)
...     
html
<html lang="en">
<head>
<meta charset="utf-8"/>
<title>The HTML5</title>
</head>
<body>
<tag1>
<tag2 name="tag 2">
<tag3 name="Korea"></tag3>Closed<br/>
<tag3 name="China"></tag3>A Big contry<br/>
<tag3 name="Japan"></tag3>Nippon<br/>
</tag2>
</tag1>
</body>
</html>
So it works as expected when use children on soup object.

Find a tag then use children.
>>> tag_1 = soup.find('tag1')
>>> for elem in tag_1.children:
...     print(elem)
... 

<tag2 name="tag2">
<tag3 name="Korea">Closed</tag3><br/>
<tag3 name="China">A Big contry</tag3><br/>
<tag3 name="Japan">Nippon</tag3><br/>
</tag2>
I don't remember last i used children,so it's not much used.

One more using CSS selector which is powerful and often forgotten.
>>> find_china = soup.select_one('tag3:nth-child(3)')
>>> find_china
<tag3 name="China">A Big contry</tag3>
>>> find_china.text
'A Big contry'
Reply
#4
Hi

Thank's to both of you, Larz60+ and snippsat.

I think, I begin to understand a little better. I got confused by what I remember from my reading of W3school pages concerning HTML, XML and so on. The concept of DOM-node is, to my understanding, a bit different than the concept of BeautifulSoup-tag.

In terms of DOM-nodes, in the sentence "<html><head/><body/></html>, the html node does not contain "<head/><body/>", and both of them are its children. At least, this is the way I understood W3school pages.

The Beautifulsoup documentation is a rather a tutorial which gives a lot of examples. It contains much information, however some of which being disseminated throughout the document. For example, creating attributes as if they were pieces of a dictionnary is mentionned close to the beginning of the document. When one wants to add an attribute to a tag, one may well have forgotten this possibility and look somewhere down in the document to read how to do such thing.

To make things still clearer, I would appreciate to have access to sort of a BeautifulSoup reference book. I mean a document which defines the objects, lists and explain the methods and the arguments to be use to call them. I did not find such a document in BeautifulSoup pages.

Have you an idea where I could find it ?

Arbiel
using Ubuntu 18.04.4 LTS, Python 3.8
having substituted «https://www.lilo.org/fr/» to google, «https://protonmail.com/» to any other unsafe mail service and bépo to azerty (french keyboard layouts)
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020