Python Forum
parsing local xml files to csv
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
parsing local xml files to csv
#1
hi,
I'm pretty new in Python.
ddi xml files available in the microdata service of the International Labour Organization.I downloaded these to my computer.
I want to parse these xml files with python and export to csv or excel.
Since the sample content is too long, I didn't copy it here.
I'm giving the linkhttps://www.ilo.org/surveydata/index.php...g/ddi/1357

How can I parse all xml tags?
How can I parse all the features.
I had a lot of attempts, but I wasn't successful.
I wanted to use namespace but it didn't happen.
result none always

Variable is defined in the last section of the same variable for more than one category of code to combine them into a single cell is possible to print

Example:
<var ID="V210" name="educ" files="F3" dcml="0" intrvl="discrete">
<location StartPos="25" EndPos="25" width="1" RecSegNo="1"/>
<labl>
Respondent education level
</labl>
<valrng>
<range UNITS="REAL" min="1" max="5"/>
</valrng>
<sumStat type="vald">
1000
</sumStat>
<sumStat type="invd">
0
</sumStat>
<sumStat type="min">
1
</sumStat>
<sumStat type="max">
5
</sumStat>
<catgry>
<catValu>
1
</catValu>
<labl>
completed primary or less
</labl>
<catStat type="freq">
662
</catStat>
</catgry>
<catgry>
<catValu>
2
</catValu>
<labl>
secondary
</labl>
<catStat type="freq">
291
</catStat>
</catgry>
<catgry>
<catValu>
3
</catValu>
<labl>
completed tertiary or more
</labl>
<catStat type="freq">
46
</catStat>
</catgry>
<catgry>
<catValu>
4
</catValu>
<labl>
(dk)
</labl>
<catStat type="freq">
0
</catStat>
</catgry>
<catgry>
<catValu>
5
</catValu>
<labl>
(rf)
</labl>
<catStat type="freq">
1
</catStat>
</catgry>
<varFormat type="numeric" schema="other"/>
</var>

Thank you in advance for your help
Reply
#2
You use a parser Python has two big ones that most use BeautifulSoup and lxml.
To give start example and parse a couple of values.
from bs4 import BeautifulSoup

soup = BeautifulSoup(open('NPL_2008_LFS_v01_M_v01_A_ILOVAR.xml', encoding='utf-8'), 'xml')
title = soup.find('titl')
producer = soup.find('producer')
print(title.text.strip())
print(producer.attrs.get('affiliation'))
Output:
NPL_2008_LFS_v01_M International Labour Organization
Reply
#3
Thank you so much for answering.
I reviewed the Beautiful Soup package you mentioned.
I've also tried xml, which has a simple level of html or a few elements in xml.
But I've never been successful in nested tags.

how to get the tags and elements between the <dataDscr> ... </ dataDscr> tags,attributes and transfer them to csv.

How do I make a find and loop for <dataDscr><var> ... </ var></ dataDscr> tags?

There are too many xml files, So have to I ,to define all the tags and attributes that depend on them one by one?

for example:Located at the end of the xml file
<var ID="V541" name="ilo_neet" files="F6" dcml="0" intrvl="discrete">

There are multiple categories for this variable
<catgry>
<catValu>
1
</catValu>
<labl>
Youth not in education, employment or training
</labl>
<catStat type="freq">
4967
</catStat>
</catgry>
<catgry missing="Y">
<catValu>
Sysmiss
</catValu>
<catStat type="freq">
71241
</catStat>
</catgry>


(Feb-23-2019, 12:22 PM)snippsat Wrote: You use a parser Python has two big ones that most use BeautifulSoup and lxml.
To give start example and parse a couple of values.
from bs4 import BeautifulSoup

soup = BeautifulSoup(open('NPL_2008_LFS_v01_M_v01_A_ILOVAR.xml', encoding='utf-8'), 'xml')
title = soup.find('titl')
producer = soup.find('producer')
print(title.text.strip())
print(producer.attrs.get('affiliation'))
Output:
NPL_2008_LFS_v01_M International Labour Organization
Reply
#4
(Feb-23-2019, 03:17 PM)erdem_ustunmu Wrote: how to get the tags and elements between the <dataDscr> ... </ dataDscr> tags,attributes and transfer them to csv.

How do I make a find and loop for <dataDscr><var> ... </ var></ dataDscr> tags?

There are too many xml files, So have to I ,to define all the tags and attributes that depend on them one by one?
You have to start testing as it's big file on first how to get data out,then think of structure wanted over to CSV.
from bs4 import BeautifulSoup

soup = BeautifulSoup(open('NPL_2008_LFS_v01_M_v01_A_ILOVAR.xml', encoding='utf-8'), 'xml')
data = soup.find('dataDscr')
So inside dataDscr there are many var tages.
Using find() get the first one,all is find_all().
look at data in first one.
>>> var = data.find('var')
>>> var
<var ID="V270" dcml="0" files="F6" intrvl="contin" name="PSU">
<location width="16"/>
<labl>
        PSU
      </labl>
<valrng>
<range max="1800" min="1001"/>
</valrng>
<sumStat type="vald">
        76208
      </sumStat>
<sumStat type="invd">
        0
      </sumStat>
<sumStat type="min">
        1001
      </sumStat>
<sumStat type="max">
        1800
      </sumStat>
<sumStat type="mean">
        1412.79
      </sumStat>
<sumStat type="stdev">
        231.955
      </sumStat>
<varFormat schema="other" type="numeric"/>
</var>

# All attributes
>>> var.attrs
{'ID': 'V270', 'dcml': '0', 'files': 'F6', 'intrvl': 'contin', 'name': 'PSU'}

# Get name
>>> var.attrs.get('name')
'PSU'

# All sumStat
>>> [i.text.strip() for i in var.find_all('sumStat')]
['76208', '0', '1001', '1800', '1412.79', '231.955']
>>> 
Reply
#5
Hello;
Thank you very much for your help and your efforts.
I started doing something with your solutions and examples.
Trying to adapt slowly to your example
from bs4 import BeautifulSoup
lst = []
soup = BeautifulSoup(open('NPL_2008_LFS_v01_M_v01_A_ILOVAR.xml', encoding='utf-8'), 'xml')
title = soup.find('titl')
producer = soup.find('producer')
print(title.text.strip())
print(producer.attrs.get('affiliation'))
data = soup.find('dataDscr')
vars = data.find_all('var')
for var in vars:
    ID=var.attrs.get('ID')
    name=var.attrs.get('name')
    files=var.attrs.get('files')
    dcml=var.attrs.get('dcml')
    intrvl=var.attrs.get('intrvl')
    labl=var.find('labl').text.strip()
    sumStat=[i.text.strip() for i in var.find_all('sumStat')]
    print(title.text.strip(),producer.text.strip(),ID,name,files,dcml,intrvl,labl,sumStat)
    lst.append((title.text.strip(),producer.text.strip(),ID,name,files,dcml,intrvl,labl,sumStat))
    
I couldn't do it even though I tried a lot.So I want to ask 2 things here.
1-is it possible to write text of type for sumstat
# All sumStat
>>> [i.text.strip() for i in var.find_all('sumStat')]
as below or otherwise
['vald':'76208', 'invd':'0', 'min':'1001', 'max':'1800', 'mean':'1412.79', 'stdev':'231.955']


2.How do I combine category tags?

<catgry>
        <catValu>
          1
        </catValu>
        <labl>
          Eastern
        </labl>
        <catStat type="freq">
          16926
        </catStat>
      </catgry>
      <catgry>
        <catValu>
          2
        </catValu>
        <labl>
          Central
        </labl>
        <catStat type="freq">
          31316
        </catStat>
      </catgry>
      <catgry>
        <catValu>
          3
        </catValu>
        <labl>
          Western
        </labl>
        <catStat type="freq">
          13527
        </catStat>
      </catgry>
      <catgry>
        <catValu>
          4
        </catValu>
        <labl>
          Mid-Western
        </labl>
        <catStat type="freq">
          8060
        </catStat>
      </catgry>
      <catgry>
        <catValu>
          5
        </catValu>
        <labl>
          Far-Western
        </labl>
        <catStat type="freq">
          6379
        </catStat>
</catgry>
Like the example below
['1-Eastern','freq':'16926'] | ['2-Central','freq':'31316'] | ['3-Western','freq':'13527'] |......

I would like your help.
Best regards
Reply
#6
(Feb-24-2019, 11:46 AM)erdem_ustunmu Wrote: >>> [i.text.strip() for i in var.find_all('sumStat')]
as below or otherwise
['vald':'76208', 'invd':'0', 'min':'1001', 'max':'1800', 'mean':'1412.79', 'stdev':'231.955']


2.How do I combine category tags?
Example:
>>> sum_stat = [i.text.strip() for i in var.find_all('sumStat')]
>>> sum_stat
['76208', '0', '1001', '1800', '1412.79', '231.955']

>>> att = [i.attrs['type'] for i in var.find_all('sumStat')]
>>> att
['vald', 'invd', 'min', 'max', 'mean', 'stdev']

>>> # Combine with zip()
>>> list(zip(att, sum_stat))
[('vald', '76208'),
 ('invd', '0'),
 ('min', '1001'),
 ('max', '1800'),
 ('mean', '1412.79'),
 ('stdev', '231.955')]

>>> dict(list(zip(att, sum_stat)))
{'vald': '76208',
 'invd': '0',
 'min': '1001',
 'max': '1800',
 'mean': '1412.79',
 'stdev': '231.955'}
erdem_ustunmu Wrote:2.How do I combine category tags?
Try something yourself based on info gotten til now.
Reply
#7
Thank you @snippsat so much for your help.
I've been doing this all day.I've made different experiments with what you wrote.
I managed to do some of them. Simply the ones.
great thing you did for sumStat.
I tried to do something for catgry. But the result is not correct.

always the latest values, All values are not coming.

import itertools
from bs4 import BeautifulSoup
lst = []
soup = BeautifulSoup(open('NPL_2008_LFS_v01_M_v01_A_ILOVAR.xml', encoding='utf-8'), 'xml')
title = soup.find('titl')
producer = soup.find('producer')
#affiliation=(soup.find('producer'))['affiliation']
#print(title.text.strip())
#print(producer.attrs.get('affiliation'))
data = soup.find('dataDscr')
vars = data.find_all('var')
for var in vars:
    
    ID=var.attrs.get('ID')
    name=var.attrs.get('name')
    files=var.attrs.get('files')
    dcml=var.attrs.get('dcml')
    intrvl=var.attrs.get('intrvl')
    labl=var.find('labl').text.strip()
    sumStat=[i.text.strip() for i in var.find_all('sumStat')]
   
    VarFormat=(var.find('varFormat')).attrs.get('type')
    stdCatgry = [stdCat.text.strip() for stdCat in  var.find_all("stdCatgry")]
    
    #There is a mistake, I will look after merge the categories.
    #Range_Min=var.find_all('range')
    #Range_Unit=(var.find_all('range'))['UNITS']
    #Range_Min=(var.find_all('range'))['min']
    #Range_Max=(var.find_all('range'))['max']
    #print(Range_Min)
    

    #I tried to do as follows. I could not be successful.

    for cat in var.find_all('catgry'):
        
               
        catValu =  [ values.text.strip() for values in cat.findAll("catValu")]
        catlabl =  [ values.text.strip() for values in cat.findAll("labl")]
        data = [item for item in itertools.zip_longest(catValu, catlabl)]
        
    print(title.text.strip(),producer.text.strip(),ID,name,files,dcml,intrvl,labl,sumStat,VarFormat,data,stdCatgry)
    lst.append((title.text.strip(),producer.text.strip(),ID,name,files,dcml,intrvl,labl,sumStat,VarFormat,data,stdCatgry))
As you said, I'm trying to do something from the code you've written.
but it did not succeed as I wrote above with red. It brings the most recent value, and I couldn't merge with <catValu> and <labl> (<catValu>-<labl>).
yours sincerely
Reply
#8
hi snippsat;
I tried to get and combine the categories according to the code you wrote yesterday.

for cat in var.find_all('catgry'):
        
        cat_value=[value.text.strip() for value in cat.find_all('catValu')]
        cat_label=[value.text.strip() for value in cat.find_all('labl')]
        cat_stat=[value.text.strip() for value in cat.find_all('catStat')]
        categories=dict(list(zip(cat_value, cat_label,cat_stat)))
        print(categories)


When I tried to do it yesterday, it brought the latest category, not all categories.
Now I've tried to make it look like you wrote it, but it didn't.

I wonder if I'm making a logical mistake.

so far codes:
from bs4 import BeautifulSoup
lst = []
categories = []
soup = BeautifulSoup(open('NPL_2008_LFS_v01_M_v01_A_ILOVAR.xml', encoding='utf-8'), 'xml')
title = soup.find('titl')
producer = soup.find('producer')
data = soup.find('dataDscr')
vars = data.find_all('var')
for var in vars:
    
    ID=var.attrs.get('ID')
    name=var.attrs.get('name')
    files=var.attrs.get('files')
    dcml=var.attrs.get('dcml')
    intrvl=var.attrs.get('intrvl')
    labl=var.find('labl').text.strip()
   
    sum_Stat=[i.text.strip() for i in var.find_all('sumStat')] 
    sum_Att = [i.attrs['type'] for i in var.find_all('sumStat')] 
   
    sumStat=dict(list(zip(sum_Att, sum_Stat))) 
    VarFormat=(var.find('varFormat')).attrs.get('type')
    stdCatgry = [stdCat.text.strip() for stdCat in  var.find_all("stdCatgry")]
    Range = [i.attrs for i in var.find_all('range')]
    

    
    for cat in var.find_all('catgry'):
        
        cat_value=[value.text.strip() for value in cat.find_all('catValu')]
        cat_label=[value.text.strip() for value in cat.find_all('labl')]
        cat_stat=[value.text.strip() for value in cat.find_all('catStat')]
        categories=dict(list(zip(cat_value, cat_label,cat_stat)))
        print(categories)
        

        
    print(title.text.strip(),producer.text.strip(),ID,name,files,dcml,intrvl,labl,sumStat,VarFormat,stdCatgry,Range,categories)
    lst.append((title.text.strip(),producer.text.strip(),ID,name,files,dcml,intrvl,labl,sumStat,VarFormat,stdCatgry,Range,categories))
Reply
#9
hi,
I've found a solution for categories like this.
categories = []
    for cat in var.find_all('catgry'):
        cat_value=(cat.find('catValu')).text.strip()
        if cat.find("labl") is None: 
           
           if cat.find("catStat") is None: 
              cat_label=''
              
           else:
              catStat_label =(cat.find('catStat'))
              cat_label=catStat_label.attrs['type']+':'+catStat_label.text.strip()
                     
        elif cat.find("catStat") is None: 
            cat_label=(cat.find('labl')).text.strip()
            
        else:
           catStat_label =(cat.find('catStat'))
           cat_label=((cat.find('labl')).text.strip()) +':'+catStat_label.attrs['type']+':'+catStat_label.text.strip()
        
        
        asd=str(cat_value)+' - '+str(cat_label)
        categories.append(asd)
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  uploading files from a ubuntu local directory to Minio storage container dchilambo 0 399 Dec-22-2023, 07:17 AM
Last Post: dchilambo
  How to take the tar backup files form remote server to local server sivareddy 0 1,871 Jul-14-2021, 01:32 PM
Last Post: sivareddy
  opening files and output of parsing leodavinci1990 4 2,456 Oct-12-2020, 06:52 AM
Last Post: bowlofred
  Parsing Xml files >3gb using lxml iterparse Prit_Modi 2 2,302 May-16-2020, 06:53 AM
Last Post: Prit_Modi
  Parsing Attached .MSG Files with Python3 ericl42 1 3,630 Apr-12-2019, 06:28 PM
Last Post: ericl42
  Fetching html files from local directories shiva 3 3,384 Mar-20-2018, 05:12 PM
Last Post: wavic

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020