Python Forum
How to display XML tree structure with Python? - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html)
+--- Thread: How to display XML tree structure with Python? (/thread-6204.html)

Pages: 1 2


How to display XML tree structure with Python? - sonicblind - Nov-10-2017

Hi,

I have a large multi-level XML document of a complicated structure, without any namespace definition.
I would like to generate a simplified tree view of its structure, so that every possible element from the XML is shown and only once.

As a simplified example take this XML:

<data>
	<timestamp>...</timestamp>
	<people>
		<person>
			<name>...</name>
			<age>...</age>
		</person>
		<person>
			<name>...</name>
			<age>...</age>
			<degree />
		</person>
		<person>
			<name>...</name>
			<age>...</age>
			<degree />
			<siblings>
				<brother>...</brother>
				<brother>...</brother>
				<sister>...</sister>
			</siblings>			
		</person>
	</people>
	<cities>
		<city>
			<name>...</name>
			<country>...</country>
			<continent>...</continent>
			<capital />
		</city>
		<city>
			<name>...</name>
			<country>...</country>
			<continent>...</continent>
		</city>
	</cities>
</data>
Using Python I would like to generate a view of its structure, looking something like this:

-data-
	-timestamp-
	-people-
		-person-
			-name-
			-age-
			-degree-
			-siblings-
				-brother-
				-sister-
	-cities-
		-city-
			-name-
			-country-
			-continent-
			-capital-
So, basically I am not interested in the values, or how many elements of the same type are in the XML, etc.
I only want to see which elements are in there.

I know there might be visual tools to achieve this, but I need to be able to generate such tree view also directly inside python script.

Thanks for any ideas.


RE: How to display XML tree structure with Python? - Larz60+ - Nov-10-2017

lxml has etree (prettyprint option) see: http://lxml.de/api.html


RE: How to display XML tree structure with Python? - sonicblind - Nov-10-2017

prettyprint does not help me as it shows everything as is in the XML. That's exactly what I want to avoid. I need no duplicates and no values or attributes. Only the very basic tree structure.


RE: How to display XML tree structure with Python? - Larz60+ - Nov-10-2017

You can look here to see what's available as packages: https://pypi.python.org/pypi?%3Aaction=search&term=xml&submit=search
You may have to write it yourself, it you can't find what you're looking for


RE: How to display XML tree structure with Python? - snippsat - Nov-10-2017

Just getting name of tags work fine in both lxml and BeautifulSoup.
Keeping the structure in output can be a challenge,
as both pretty print()lxml and prettify()BS i do not think work for text output.

Example getting tag names:
from lxml import etree
from bs4 import BeautifulSoup

xml = '''\
<data>
    <timestamp>...</timestamp>
    <people>
        <person>
            <name>...</name>
            <age>...</age>
        </person>
        <person>
            <name>...</name>
            <age>...</age>
            <degree />
        </person>
        <person>
            <name>...</name>
            <age>...</age>
            <degree />
            <siblings>
                <brother>...</brother>
                <brother>...</brother>
                <sister>...</sister>
            </siblings>
        </person>
    </people>
    <cities>
        <city>
            <name>...</name>
            <country>...</country>
            <continent>...</continent>
            <capital />
        </city>
        <city>
            <name>...</name>
            <country>...</country>
            <continent>...</continent>
        </city>
    </cities>
</data>
'''

root = etree.fromstring(xml)
soup = BeautifulSoup(xml, 'lxml')

# lxml
for node in root.iter('*'):
    print(node.tag)

# BS
for tag in soup.findChildren():
    print(tag.name)
Output:
data timestamp people person name age person name age degree person name age degree siblings brother brother sister cities city name country continent capital city name country continent



RE: How to display XML tree structure with Python? - wavic - Nov-10-2017

Pretty straight away:

from lxml import etree
from collections import Counter

xml = '''\
<data>
    <timestamp>...</timestamp>
    <people>
        <person>
            <name>...</name>
            <age>...</age>
        </person>
        <person>
            <name>...</name>
            <age>...</age>
            <degree />
        </person>
        <person>
            <name>...</name>
            <age>...</age>
            <degree />
            <siblings>
                <brother>...</brother>
                <brother>...</brother>
                <sister>...</sister>
            </siblings>
        </person>
    </people>
    <cities>
        <city>
            <name>...</name>
            <country>...</country>
            <continent>...</continent>
            <capital />
        </city>
        <city>
            <name>...</name>
            <country>...</country>
            <continent>...</continent>
        </city>
    </cities>
</data>
'''

root = etree.fromstring(xml)

for tag in root.iter():
    path = tree.getpath(tag)
    path = path.replace('/', '    ')
    spaces = Counter(path)
    tag_name = path.split()[-1].split('[')[0]
    tag_name = ' ' * (spaces[' '] - 4) + tag_name
    print(tag_name)
Output:
data     timestamp     people         person             name             age         person             name             age             degree         person             name             age             degree             siblings                 brother                 brother                 sister     cities         city             name             country             continent             capital         city             name             country             continent



RE: How to display XML tree structure with Python? - wavic - Nov-15-2017

I have missed to put tree = etree.ElementTree(root) before the for loop


RE: How to display XML tree structure with Python? - sonicblind - Nov-15-2017

Thanks to all of you for the tips!
They helped me to achieve my goal.

wavic - My aim was to have no duplicates, so your code was almost perfect, but I reworked it a bit also to include the attributes.

Here is my final code in case it helps somebody else as well.
I will use it any time I need to see clearly the structure of any XML file, to know all tags/attributes which I need to consider.

import re, collections
from lxml import etree
 
xml = '''\
<data>
    <timestamp>not important</timestamp>
    <people>
        <person name="Blue" given="John">
            <occupation>not important</occupation>
            <age>not important</age>
        </person>
        <person name="Green" given="Peter">
            <occupation>not important</occupation>
            <age>not important</age>
            <degree />
        </person>
        <person name="Red" given="Angela" maiden="Orange">
            <occupation fulltime="yes">not important</occupation>
            <age>not important</age>
            <birthday>not important</birthday>
            <degree />
            <siblings >
                <brother attrib1="no" attrib2="yes">not important</brother>
                <brother attrib1="yes">not important</brother>
                <sister>not important</sister>
            </siblings>
        </person>
    </people>
    <cities>
        <city name="Tokyo">
            <country>not important</country>
            <continent>not important</continent>
            <capital />
        </city>
        <city name="Atlanta">
            <country>not important</country>
            <continent>not important</continent>
            <olympics count="1">
            	<year>1996</year>
            	<season>summer</season>
            </olympics>
        </city>
    </cities>
</data>
'''

xml_root = etree.fromstring(xml)
raw_tree = etree.ElementTree(xml_root)
nice_tree = collections.OrderedDict()

for tag in xml_root.iter():
	path = re.sub('\[[0-9]+\]', '', raw_tree.getpath(tag))
	if path not in nice_tree:
		nice_tree[path] = []
	if len(tag.keys()) > 0:
		nice_tree[path].extend(attrib for attrib in tag.keys() if attrib not in nice_tree[path])			

for path, attribs in nice_tree.items():
	indent = int(path.count('/') - 1)
	print('{0}{1}: {2} [{3}]'.format('    ' * indent, indent, path.split('/')[-1], ', '.join(attribs) if len(attribs) > 0 else '-'))
Which gives me following result:
Output:
0: data [-] 1: timestamp [-] 1: people [-] 2: person [name, given, maiden] 3: occupation [fulltime] 3: age [-] 3: degree [-] 3: birthday [-] 3: siblings [-] 4: brother [attrib1, attrib2] 4: sister [-] 1: cities [-] 2: city [name] 3: country [-] 3: continent [-] 3: capital [-] 3: olympics [count] 4: year [-] 4: season [-]



RE: How to display XML tree structure with Python? - wavic - Nov-15-2017

Good! At first, I was thinking that this will be a difficult task but it seems that xpath is of great help.


RE: How to display XML tree structure with Python? - mreshko - Aug-12-2020

Hi sonicblind.

Great code! Very useful. Thank you.

It would be great if you could add these two feature to the code:

(1) show the child's' number after the level, e.g.
3.0: occupation [fulltime]
3.1: age [-]
3.2: degree [-]
3.3: birthday [-]
3.4: siblings [-]

(2) show the number of identical siblings, for example, if there were, say, 100 "person" elements, it would
display it as
2: person [name, given, maiden] [100]

Many thanks