Python Forum

Hi

I'am a little disappointed with beautifulSoup handling a tree.

>>> from bs4 import BeautifulSoup as btfs
>>> soup=btfs('', 'html5lib')
>>> soup
<html><head></head><body></body></html>
>>> for elem in soup.children:
...     print(elem)
... 
<html><head></head><body></body></html>
>>>

I would expect, in the preceding example, soup to have as single child <html></html> rather than <html><head></head><body></body></html>.

Apparently, using another parser than html5lib does not make any difference.

So, how could I get <html></html> as a child of soup ?

Arbiel

There are several ways to do this. Read the docs.
you have element.select using CSS select tag
tags can be refered to as div.a.get('href')
and others

The doc about this .contents and .children.

Here a working test setup.

from bs4 import BeautifulSoup

html = '''\
<!doctype html>
<html lang="en">
<head>
  <meta charset="utf-8">
  <title>The HTML5</title>
</head>
<body>
  <tag1>
    <tag2 name="tag2">
      <tag3 name="Korea">Closed</tag3><br>
      <tag3 name="China">A Big contry</tag3><br>
      <tag3 name="Japan">Nippon</tag3><br>
    </tag2>
  </tag1>
</body>
</html>'''

soup = BeautifulSoup(html, 'lxml')

Usage test:

>>> for elem in soup.children:
...     print(elem)
...     
html
<html lang="en">
<head>
<meta charset="utf-8"/>
<title>The HTML5</title>
</head>
<body>
<tag1>
<tag2 name="tag 2">
<tag3 name="Korea"></tag3>Closed<br/>
<tag3 name="China"></tag3>A Big contry<br/>
<tag3 name="Japan"></tag3>Nippon<br/>
</tag2>
</tag1>
</body>
</html>

So it works as expected when use children on soup object.

Find a tag then use children.

>>> tag_1 = soup.find('tag1')
>>> for elem in tag_1.children:
...     print(elem)
... 

<tag2 name="tag2">
<tag3 name="Korea">Closed</tag3><br/>
<tag3 name="China">A Big contry</tag3><br/>
<tag3 name="Japan">Nippon</tag3><br/>
</tag2>

I don't remember last i used children,so it's not much used.

One more using CSS selector which is powerful and often forgotten.

>>> find_china = soup.select_one('tag3:nth-child(3)')
>>> find_china
<tag3 name="China">A Big contry</tag3>
>>> find_china.text
'A Big contry'

Hi

Thank's to both of you, Larz60+ and snippsat.

I think, I begin to understand a little better. I got confused by what I remember from my reading of W3school pages concerning HTML, XML and so on. The concept of DOM-node is, to my understanding, a bit different than the concept of BeautifulSoup-tag.

In terms of DOM-nodes, in the sentence "<html><head/><body/></html>, the html node does not contain "<head/><body/>", and both of them are its children. At least, this is the way I understood W3school pages.

The Beautifulsoup documentation is a rather a tutorial which gives a lot of examples. It contains much information, however some of which being disseminated throughout the document. For example, creating attributes as if they were pieces of a dictionnary is mentionned close to the beginning of the document. When one wants to add an attribute to a tag, one may well have forgotten this possibility and look somewhere down in the document to read how to do such thing.

To make things still clearer, I would appreciate to have access to sort of a BeautifulSoup reference book. I mean a document which defines the objects, lists and explain the methods and the arguments to be use to call them. I did not find such a document in BeautifulSoup pages.

Have you an idea where I could find it ?

Arbiel

arbiel

Larz60+

snippsat

arbiel