Several xml files to dataframe - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: Several xml files to dataframe (/thread-38238.html) |
Several xml files to dataframe - mfernandes - Sep-20-2022 I have several xml files that I want to transform into a dataframe. Each xml file should be in one row. Here is an example of a xml file: <?xml version='1.0' encoding='UTF-8'?> <compteRendu xmlns="http://schemas.assemblee-nationale.fr/referentiel"> <uid>CRSANR5L15S2017E1N001</uid> <metadonnees> <day>04 june 2017</day> </metadonnees> <contenu> <point nivpoint="1" valeur_ptsodj="2" ordinal_prise="1" id_preparation="819547" ordre_absolu_seance="8" code_grammaire="TITRE_TEXTE_DISCUSSION" code_style="Titre" code_parole="" sommaire="1" id_syceron="981344" valeur=""> <orateurs/> <texte>Déclaration de...</texte> <paragraphe valeur_ptsodj="2" ordinal_prise="1" id_preparation="819550" ordre_absolu_seance="11" id_acteur="PA345619" id_mandat="-1" id_nomination_oe="PM725692" id_nomination_op="-1" code_grammaire="DEBAT_1_10" code_style="NORMAL" code_parole="PAROLE_1_2" sommaire="1" id_syceron="981347" valeur=""> <orateurs> <orateur> <name>M. Edouard Philippe</name> </orateur> </orateurs> <texte>Monsieur le président...</texte> </paragraphe> </point> </contenu> </compteRendu>Here is my code: import xml.etree.ElementTree as ET import pandas as pd path = "whereIhavexmlfilessaved" # create a dict with first childs as key and descendants as values d = {'metadonnees':['day'], 'contenu':['nom','texte']} # initialize two lists: `cols` and `data` cols, data = list(), list() df=pd.DataFrame() for filename in os.listdir(path): if filename.endswith('.xml'): tree = ET.parse(path+"/"+filename) root = tree.getroot() # loop through d.item for k, v in d.items(): # find child child = root.find(f'{{*}}{k}') # use iter to check each descendant (`elem`) for elem in child.iter(): # get `tag_end` for each descendant, # e.g. `texte` in "{http://schemas.assemblee-nationale.fr/referentiel}texte" tag_end = elem.tag.split('}')[-1] # check if `tag_end` in `v(alue)` if tag_end in v: # add `tag_end` and `elem.text` to appropriate list cols.append(tag_end) data.append(elem.text) dt = pd.DataFrame(data) # helper function to "increment" duplicate col names def f(lst): d = {} out = [] for i in lst: if i not in d: out.append(i) d[i] = 2 else: out.append(i+str(d[i])) d[i] += 1 return out dt.columns = f(cols) df.append(dt)My code only returns an empty dataframe. The original xml files are much longer. I should only have one column "day", but several columns "name" and "text". Not all xml files are exactly the same. For some xml files the columns are: day, text, name1, text1,...; for others are: day, text, text1, name2, text2,... Here is an example of the dataframe that I want to obtain: day text name1 text1 name2 text2 04 june 2017 Déclaration de... Edouard Philippe Monsieur le président John python cool 05 june 2017 Hello world NaN World now Mary USA country ...Could anyone help me improve my code? RE: Several xml files to dataframe - Larz60+ - Sep-20-2022 see https://pandas.pydata.org/docs/dev/reference/api/pandas.read_xml.html RE: Several xml files to dataframe - mfernandes - Sep-20-2022 Thank you for your suggestion, but I already tried pd.read_xml(xml), I just obtain 3 columns: 'uid', 'day' and 'point'. |