Several xml files to dataframe

Several xml files to dataframe - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: Several xml files to dataframe (/thread-38238.html)

Several xml files to dataframe - mfernandes - Sep-20-2022

I have several xml files that I want to transform into a dataframe. Each xml file should be in one row. Here is an example of a xml file:

<?xml version='1.0' encoding='UTF-8'?>
<compteRendu xmlns="http://schemas.assemblee-nationale.fr/referentiel">
  <uid>CRSANR5L15S2017E1N001</uid>
  <metadonnees>
    <day>04 june 2017</day>
  </metadonnees>
  <contenu>
    <point nivpoint="1" valeur_ptsodj="2" ordinal_prise="1" id_preparation="819547" ordre_absolu_seance="8" code_grammaire="TITRE_TEXTE_DISCUSSION" code_style="Titre" code_parole="" sommaire="1" id_syceron="981344" valeur="">
      <orateurs/>
      <texte>Déclaration de...</texte>
      <paragraphe valeur_ptsodj="2" ordinal_prise="1" id_preparation="819550" ordre_absolu_seance="11" id_acteur="PA345619" id_mandat="-1" id_nomination_oe="PM725692" id_nomination_op="-1" code_grammaire="DEBAT_1_10" code_style="NORMAL" code_parole="PAROLE_1_2" sommaire="1" id_syceron="981347" valeur="">
        <orateurs>
          <orateur>
            <name>M. Edouard Philippe</name>
          </orateur>
        </orateurs>
        <texte>Monsieur le président...</texte>
      </paragraphe>
    </point>
  </contenu>
</compteRendu>

Here is my code:

import xml.etree.ElementTree as ET
import pandas as pd

path = "whereIhavexmlfilessaved"

# create a dict with first childs as key and descendants as values
d = {'metadonnees':['day'],
     'contenu':['nom','texte']}

# initialize two lists: `cols` and `data`
cols, data = list(), list()

df=pd.DataFrame()

for filename in os.listdir(path):
    if filename.endswith('.xml'):
        tree = ET.parse(path+"/"+filename)
        root = tree.getroot()
        # loop through d.item
        for k, v in d.items():
            # find child
            child = root.find(f'{{*}}{k}')
            # use iter to check each descendant (`elem`)
            for elem in child.iter():
                # get `tag_end` for each descendant, # e.g. `texte` in "{http://schemas.assemblee-nationale.fr/referentiel}texte"
                tag_end = elem.tag.split('}')[-1]
                # check if `tag_end` in `v(alue)`
                if tag_end in v:
                    # add `tag_end` and `elem.text` to appropriate list
                    cols.append(tag_end)
                    data.append(elem.text)
                    dt = pd.DataFrame(data)
        # helper function to "increment" duplicate col names
        def f(lst):
            d = {}
            out = []
            for i in lst:
                if i not in d:
                    out.append(i)
                    d[i] = 2
                else:
                    out.append(i+str(d[i]))
                    d[i] += 1
                return out
        dt.columns = f(cols)
        df.append(dt)

My code only returns an empty dataframe. The original xml files are much longer. I should only have one column "day", but several columns "name" and "text". Not all xml files are exactly the same. For some xml files the columns are: day, text, name1, text1,...; for others are: day, text, text1, name2, text2,... Here is an example of the dataframe that I want to obtain:

day           text               name1                 text1             name2    text2
04 june 2017  Déclaration de... Edouard Philippe  Monsieur le président  John     python cool
05 june 2017  Hello world          NaN                     World now     Mary     USA country
...

Could anyone help me improve my code?

RE: Several xml files to dataframe - Larz60+ - Sep-20-2022

see https://pandas.pydata.org/docs/dev/reference/api/pandas.read_xml.html

RE: Several xml files to dataframe - mfernandes - Sep-20-2022

Thank you for your suggestion, but I already tried pd.read_xml(xml), I just obtain 3 columns: 'uid', 'day' and 'point'.