Python Forum
Several xml files to dataframe
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Several xml files to dataframe
#1
I have several xml files that I want to transform into a dataframe. Each xml file should be in one row. Here is an example of a xml file:
<?xml version='1.0' encoding='UTF-8'?>
<compteRendu xmlns="http://schemas.assemblee-nationale.fr/referentiel">
  <uid>CRSANR5L15S2017E1N001</uid>
  <metadonnees>
    <day>04 june 2017</day>
  </metadonnees>
  <contenu>
    <point nivpoint="1" valeur_ptsodj="2" ordinal_prise="1" id_preparation="819547" ordre_absolu_seance="8" code_grammaire="TITRE_TEXTE_DISCUSSION" code_style="Titre" code_parole="" sommaire="1" id_syceron="981344" valeur="">
      <orateurs/>
      <texte>Déclaration de...</texte>
      <paragraphe valeur_ptsodj="2" ordinal_prise="1" id_preparation="819550" ordre_absolu_seance="11" id_acteur="PA345619" id_mandat="-1" id_nomination_oe="PM725692" id_nomination_op="-1" code_grammaire="DEBAT_1_10" code_style="NORMAL" code_parole="PAROLE_1_2" sommaire="1" id_syceron="981347" valeur="">
        <orateurs>
          <orateur>
            <name>M. Edouard Philippe</name>
          </orateur>
        </orateurs>
        <texte>Monsieur le président...</texte>
      </paragraphe>
    </point>
  </contenu>
</compteRendu>
Here is my code:
import xml.etree.ElementTree as ET
import pandas as pd

path = "whereIhavexmlfilessaved"

# create a dict with first childs as key and descendants as values
d = {'metadonnees':['day'],
     'contenu':['nom','texte']}

# initialize two lists: `cols` and `data`
cols, data = list(), list()

df=pd.DataFrame()

for filename in os.listdir(path):
    if filename.endswith('.xml'):
        tree = ET.parse(path+"/"+filename)
        root = tree.getroot()
        # loop through d.item
        for k, v in d.items():
            # find child
            child = root.find(f'{{*}}{k}')
            # use iter to check each descendant (`elem`)
            for elem in child.iter():
                # get `tag_end` for each descendant, # e.g. `texte` in "{http://schemas.assemblee-nationale.fr/referentiel}texte"
                tag_end = elem.tag.split('}')[-1]
                # check if `tag_end` in `v(alue)`
                if tag_end in v:
                    # add `tag_end` and `elem.text` to appropriate list
                    cols.append(tag_end)
                    data.append(elem.text)
                    dt = pd.DataFrame(data)
        # helper function to "increment" duplicate col names
        def f(lst):
            d = {}
            out = []
            for i in lst:
                if i not in d:
                    out.append(i)
                    d[i] = 2
                else:
                    out.append(i+str(d[i]))
                    d[i] += 1
                return out
        dt.columns = f(cols)
        df.append(dt)
My code only returns an empty dataframe. The original xml files are much longer. I should only have one column "day", but several columns "name" and "text". Not all xml files are exactly the same. For some xml files the columns are: day, text, name1, text1,...; for others are: day, text, text1, name2, text2,... Here is an example of the dataframe that I want to obtain:
day           text               name1                 text1             name2    text2
04 june 2017  Déclaration de... Edouard Philippe  Monsieur le président  John     python cool
05 june 2017  Hello world          NaN                     World now     Mary     USA country
...
Could anyone help me improve my code?
Reply
#2
see https://pandas.pydata.org/docs/dev/refer...d_xml.html
Reply
#3
Thank you for your suggestion, but I already tried pd.read_xml(xml), I just obtain 3 columns: 'uid', 'day' and 'point'.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Extract parts of multiple log-files and put it in a dataframe hasiro 4 2,018 Apr-27-2022, 12:44 PM
Last Post: hasiro
  Concatenate two files with different columns into one dataframe moralear27 1 2,092 Sep-11-2020, 10:18 PM
Last Post: moralear27

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020