Nov-15-2017, 08:34 PM
Thanks to all of you for the tips!
They helped me to achieve my goal.
wavic - My aim was to have no duplicates, so your code was almost perfect, but I reworked it a bit also to include the attributes.
Here is my final code in case it helps somebody else as well.
I will use it any time I need to see clearly the structure of any XML file, to know all tags/attributes which I need to consider.
They helped me to achieve my goal.
wavic - My aim was to have no duplicates, so your code was almost perfect, but I reworked it a bit also to include the attributes.
Here is my final code in case it helps somebody else as well.
I will use it any time I need to see clearly the structure of any XML file, to know all tags/attributes which I need to consider.
import re, collections from lxml import etree xml = '''\ <data> <timestamp>not important</timestamp> <people> <person name="Blue" given="John"> <occupation>not important</occupation> <age>not important</age> </person> <person name="Green" given="Peter"> <occupation>not important</occupation> <age>not important</age> <degree /> </person> <person name="Red" given="Angela" maiden="Orange"> <occupation fulltime="yes">not important</occupation> <age>not important</age> <birthday>not important</birthday> <degree /> <siblings > <brother attrib1="no" attrib2="yes">not important</brother> <brother attrib1="yes">not important</brother> <sister>not important</sister> </siblings> </person> </people> <cities> <city name="Tokyo"> <country>not important</country> <continent>not important</continent> <capital /> </city> <city name="Atlanta"> <country>not important</country> <continent>not important</continent> <olympics count="1"> <year>1996</year> <season>summer</season> </olympics> </city> </cities> </data> ''' xml_root = etree.fromstring(xml) raw_tree = etree.ElementTree(xml_root) nice_tree = collections.OrderedDict() for tag in xml_root.iter(): path = re.sub('\[[0-9]+\]', '', raw_tree.getpath(tag)) if path not in nice_tree: nice_tree[path] = [] if len(tag.keys()) > 0: nice_tree[path].extend(attrib for attrib in tag.keys() if attrib not in nice_tree[path]) for path, attribs in nice_tree.items(): indent = int(path.count('/') - 1) print('{0}{1}: {2} [{3}]'.format(' ' * indent, indent, path.split('/')[-1], ', '.join(attribs) if len(attribs) > 0 else '-'))Which gives me following result:
Output:0: data [-]
1: timestamp [-]
1: people [-]
2: person [name, given, maiden]
3: occupation [fulltime]
3: age [-]
3: degree [-]
3: birthday [-]
3: siblings [-]
4: brother [attrib1, attrib2]
4: sister [-]
1: cities [-]
2: city [name]
3: country [-]
3: continent [-]
3: capital [-]
3: olympics [count]
4: year [-]
4: season [-]