Python Forum

Full Version: Parse data from xml file
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
I'm trying to parse data from a xml file downloaded from https://www.treasury.gov/ofac/downloads/...idated.xml

Sample of the xml file is attached.

I tried to parse the data, but it was not successful. Output i'm getting for "firstname" is an empty list

Appreciate if some can help on this.


import xml.etree.ElementTree as ET

file = ET.parse(r'D:\path\to\file\test.xml')

for node in file.getroot():
    print(node)
    firstname = node.findall('firstName')
    print(firstname)
Output:
<Element '{http://tempuri.org/sdnList.xsd}publshInformation' at 0x000001A18F7F8098> [] <Element '{http://tempuri.org/sdnList.xsd}sdnEntry' at 0x000001A18F8091D8> [] <Element '{http://tempuri.org/sdnList.xsd}sdnEntry' at 0x000001A18F809E08> []
#!/usr/bin/python3
import xml.etree.ElementTree as ET
.
file = ET.parse(r'test.xml')
.
for node in file.getroot():
    print(node)
    firstname = node.findall('{http://tempuri.org/sdnList.xsd}firstName')
    print(firstname)
Output:
<Element '{http://tempuri.org/sdnList.xsd}publshInformation' at 0xb74ffd24> [] <Element '{http://tempuri.org/sdnList.xsd}sdnEntry' at 0xb74ffdc4> [<Element '{http://tempuri.org/sdnList.xsd}firstName' at 0xb74ffe14>] <Element '{http://tempuri.org/sdnList.xsd}sdnEntry' at 0xb7502554> [<Element '{http://tempuri.org/sdnList.xsd}firstName' at 0xb75025a4>] ... ... ...
(Jun-06-2019, 04:41 AM)heiner55 Wrote: [ -> ]
#!/usr/bin/python3
import xml.etree.ElementTree as ET
.
file = ET.parse(r'test.xml')
.
for node in file.getroot():
    print(node)
    firstname = node.findall('{http://tempuri.org/sdnList.xsd}firstName')
    print(firstname)
Output:
<Element '{http://tempuri.org/sdnList.xsd}publshInformation' at 0xb74ffd24> [] <Element '{http://tempuri.org/sdnList.xsd}sdnEntry' at 0xb74ffdc4> [<Element '{http://tempuri.org/sdnList.xsd}firstName' at 0xb74ffe14>] <Element '{http://tempuri.org/sdnList.xsd}sdnEntry' at 0xb7502554> [<Element '{http://tempuri.org/sdnList.xsd}firstName' at 0xb75025a4>] ... ... ...


Thanks for the answer.

Using the way suggested I manage to parse some data.

import pandas as pd
import xml.etree.ElementTree as ET

file = ET.parse(r'test.xml')

# Create an emplty dataframe
Data_columns=['uid','firstName','lastName','sdnType']
table = pd.DataFrame(columns=Data_columns)
table = pd.DataFrame()

for node in file.getroot():
    uid= [uid.text for uid in node.findall('{http://tempuri.org/sdnList.xsd}uid')]
    firstname= [firstname.text for firstname in node.findall('{http://tempuri.org/sdnList.xsd}firstName')]
    lastName= [lastName.text for lastName in node.findall('{http://tempuri.org/sdnList.xsd}lastName')]
    sdnType= [sdnType.text for sdnType in node.findall('{http://tempuri.org/sdnList.xsd}sdnType')]
    table_List =[[uid,firstname,lastName,sdnType]]
    table1 = pd.DataFrame(table_List,columns=Data_columns)
    table = table.append(table1,ignore_index=True)

print(table)
Output:
Out[37]: uid firstName lastName sdnType 0 [] [] [] [] 1 [9639] [Ismail Abdul Salah] [HANIYA] [Individual] 2 [26182] [Evren] [KAYAKIRAN] [Individual]
How can i get the values with out brackets?

Appreciate if someone can help on this
Because it is an array:

uid == array
uid[0] ==> first element of array
uid[1] ==> second element
(Jun-07-2019, 04:22 PM)heiner55 Wrote: [ -> ]Because it is an array:

uid == array
uid[0] ==> first element of array
uid[1] ==> second element

Thanks for the answer.

I'm not sure how get an element like uid[0] as sometimes it is an empty array like [].

Appreciate someone can indicate how to make the value in array into a string.
if uid == []:
    name = "none"
else:
    name = uid[0]
or

name = uid[0] if uid != [] else "none"
(Jun-07-2019, 05:17 PM)heiner55 Wrote: [ -> ]uid[0] if uid != [] else "none"

Thanks for the answer

I have adjusted my code accordingly

import pandas as pd
import xml.etree.ElementTree as ET

file = ET.parse(r'test.xml')

# Create an emplty dataframe
Data_columns=['uid','firstName','lastName','sdnType']
table = pd.DataFrame(columns=Data_columns)
table = pd.DataFrame()

for node in file.getroot():
    uid= [uid.text for uid in node.findall('{http://tempuri.org/sdnList.xsd}uid')]
    firstname= [firstname.text for firstname in node.findall('{http://tempuri.org/sdnList.xsd}firstName')]
    lastName= [lastName.text for lastName in node.findall('{http://tempuri.org/sdnList.xsd}lastName')]
    sdnType= [sdnType.text for sdnType in node.findall('{http://tempuri.org/sdnList.xsd}sdnType')]
    table_List =[[uid[0] if uid != [] else '',firstname[0] if firstname != [] else '',lastName[0] if lastName != [] else '',sdnType[0] if sdnType != [] else '']]
    table1 = pd.DataFrame(table_List,columns=Data_columns)
    table = table.append(table1,ignore_index=True)

print(table)
Output:
Out[52]: uid firstName lastName sdnType 0 1 9639 Ismail Abdul Salah HANIYA Individual 2 26182 Evren KAYAKIRAN Individual
Now it looks better.
I tried to parse value in "programList/program"
<ns0:programList>
<ns0:program>FSE-IR</ns0:program>
</ns0:programList>
1. I managed to get value using follwing code. But is there a better way to get this?
1st try
for node in file.getroot():
    for programList in node.findall('{http://tempuri.org/sdnList.xsd}programList'):
        for program in programList.findall('{http://tempuri.org/sdnList.xsd}program'):
            print(program.text)
2nd try
def cleanaa(a):
    cleana = a[0] if a != [] else ''
    return cleana 

for node in file.getroot():
    programList1 = cleanaa([[program.text for program in programList.findall('{http://tempuri.org/sdnList.xsd}program')] for programList in node.findall('{http://tempuri.org/sdnList.xsd}programList')])
    print(programList1)
The second output seems more appropriate as it creates a list and gets multiple values if there are many(maximum there can be two values) for each iteration.
Eg:

Output:
['UKRAINE-EO13662'] ['SYRIA', 'UKRAINE-EO13662'] ['UKRAINE-EO13662']
2. Since there can be one or two values, can I get the two values into two variables, where if there is only one value the second variable will be an empty one? ('')

Appreciate if you can give some inputs to this.
Maybe this helps:

#!/usr/bin/python3

def cleanaa(a):
    cleana = a[0] if a != [] else ''
    return cleana 

[x, *y] = ['UKRAINE-EO13662']
print(x, cleanaa(y))

[x, *y] = ['SYRIA', 'UKRAINE-EO13662']
print(x, cleanaa(y))