Python Forum
Parse XML String in Pandas Dataframe - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Data Science (https://python-forum.io/forum-44.html)
+--- Thread: Parse XML String in Pandas Dataframe (/thread-22948.html)



Parse XML String in Pandas Dataframe - creedX - Dec-04-2019

Here is my situation:

I have a pandas dataframe that contains one column with an xml string for each row. I need to be able to parse the xml string for each row to see the data elements of the xml file. All the code I have been able to find is code to parse an actual xml file. I do not have the xml file, rather just the xml string (below is an example):

<?xml version="1.0" encoding="UTF-8" standalone="yes"?><ns2:application xmlns:ns2="http://www.abc.com/rules/"><InsuredSignatureOK>Yes</InsuredSignatureOK></ns2:application>

How could I return the 'Yes' if I wanted to see InsuredSignatureOK? My only thought was using a loop but I heard that is not the best way to go about it for large dataframes. I have never worked with xml before and am newish to python, so any help is greatly appreciated! Smile


RE: Parse XML String in Pandas Dataframe - scidam - Dec-05-2019

Not sure, that there is more efficient way to do this, rather than using a loop; First, you need to define a processor, a function which consumes an xml-string and returns a value what you want (extract some value(s) from xml-string, convert them etc.).

def xml_processor(xml_string): 
    # do processing
    return "The value what you want"  
There are different ways to write such a function. If xml-string has relatively simple structure, you can try to build a regular expression which do the work. For example, if you want to extract text within tag "InsuredSignatureOK" ('Yes' in the example above), you can define a regular expression for this. No special xml-parsing libraries will be needed in this case. However, this approach will work only in simple cases. Otherwise, you will need to use libraries for parsing xml-documents. You can use xml package -- which is the part of Python, or install lxml (for example).
Here is minimal working example:

import pandas as pd
import xml
 
df = pd.DataFrame({"yourColumn": ["""<?xml version="1.0" encoding="UTF-8" standalone="yes"?><ns2:application xmlns:ns2="http://www.abc.com/rules/"><InsuredSignatureOK>Yes</InsuredSignatureOK></ns2:application> """, """<?xml version="1.0" encoding="UTF-8" standalone="yes"?><ns2:application xmlns:ns2="http://www.abc.com/rules/"><InsuredSignatureOK>Yes</InsuredSignatureOK></ns2:application>"""]}) 

def xml_processor(s): 
    el = xml.dom.minidom.parseString(s) 
    tag = el.getElemntByTagName("InsuredSignatureOK")[0] 
    return tag.childNodes[0].data 

df.yourColumn = df.yourColumn.apply(xml_processor)
Note, xml_processor I just wrote is very specific, and you probably will need to write your own and use
try/except blocks to handle cases when data/xml-string is corrupted (or has unexpected structure).


RE: Parse XML String in Pandas Dataframe - creedX - Dec-09-2019

Thanks for your reply! I ended up finding a simpler approach and thought I would share for anyone dealing with XML string (although it does use a loop):

import pandas as pd
import xml.etree.ElementTree as ET

#establish dataframe
df = pd.DataFrame(myTable)

for x, row in df.iterrows() :
    myroot = ET.fromstring(row['myColumn']
    for InsuredSignatureOK in myroot.iter('InsuredSignatureOK') :
        print(InsuredSignatureOK.text)