Python Forum

Full Version: Parsing XML with lxml
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hello Smile ,
I need to parse XML file.
For example, my XML file looks like this:
<root>
 <child1 attr1="value1" attr2="value2"/>
 <child2>
   <child3>
    some text
   </child3>
 </child2>
</root>
And when I use this code:

walkAll = root.getiterator()
for elt in walkAll:
    atr = elt.attrib
    
    if elt.attrib:
        stdout.write('<%s ' %elt.tag)
        for name, value in elt.attrib.items():
            attributes =' {0:s} = "{1:s}"'.format(name, value)
            stdout.write(attributes)
            

        #print("<%s %s>" % (elt.tag, atr))
    else:
        print("<%s>" % elt.tag)
    
    if elt.text == None:
        continue
    
    print("</%s>" % elt.tag)
My output looks like:
Output:
<root> </root> <child1 attr1= "value1" attr2 = "value2"<child2> </child2> <child3> some text </child3>
And I want to look like this without convert it to string.
Output:
<root> <child1 attr1="value1" attr2="value2"/> <child2> <child3> some text </child3> </child2> </root>
I don't know what library you're using, but I don't think getiterator() is what you want to be using. It looks like you're getting all the elements in the document. In order to format it that way, you really only want one node at a time, which you can recursively parse it's children.
Can you suggest me what to use then?
I'm using lxml etree now.
Here's an example of a recursive function, which parses each node's children. Handling attributes and formatting is something I'll leave up to you :)

>>> doc = '''<root>
...  <child1 attr1="value1" attr2="value2"/>
...  <child2>
...    <child3>
...     some text
...    </child3>
...  </child2>
... </root>'''
>>> def parse(node):
...   print(f"<{node.tag}>")
...   for child in node:
...     parse(child)
...   print(f"</{node.tag}>")
...
>>> from lxml import etree
>>> root = etree.XML(doc)
>>> parse(root)
<root>
<child1>
</child1>
<child2>
<child3>
</child3>
</child2>
</root>
Thank you. :)
It was really helpful.