Python Forum

Full Version: Parse XML String in Pandas Dataframe
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Here is my situation:

I have a pandas dataframe that contains one column with an xml string for each row. I need to be able to parse the xml string for each row to see the data elements of the xml file. All the code I have been able to find is code to parse an actual xml file. I do not have the xml file, rather just the xml string (below is an example):

<?xml version="1.0" encoding="UTF-8" standalone="yes"?><ns2:application xmlns:ns2="http://www.abc.com/rules/"><InsuredSignatureOK>Yes</InsuredSignatureOK></ns2:application>

How could I return the 'Yes' if I wanted to see InsuredSignatureOK? My only thought was using a loop but I heard that is not the best way to go about it for large dataframes. I have never worked with xml before and am newish to python, so any help is greatly appreciated! Smile
Not sure, that there is more efficient way to do this, rather than using a loop; First, you need to define a processor, a function which consumes an xml-string and returns a value what you want (extract some value(s) from xml-string, convert them etc.).

def xml_processor(xml_string): 
    # do processing
    return "The value what you want"  
There are different ways to write such a function. If xml-string has relatively simple structure, you can try to build a regular expression which do the work. For example, if you want to extract text within tag "InsuredSignatureOK" ('Yes' in the example above), you can define a regular expression for this. No special xml-parsing libraries will be needed in this case. However, this approach will work only in simple cases. Otherwise, you will need to use libraries for parsing xml-documents. You can use xml package -- which is the part of Python, or install lxml (for example).
Here is minimal working example:

import pandas as pd
import xml
 
df = pd.DataFrame({"yourColumn": ["""<?xml version="1.0" encoding="UTF-8" standalone="yes"?><ns2:application xmlns:ns2="http://www.abc.com/rules/"><InsuredSignatureOK>Yes</InsuredSignatureOK></ns2:application> """, """<?xml version="1.0" encoding="UTF-8" standalone="yes"?><ns2:application xmlns:ns2="http://www.abc.com/rules/"><InsuredSignatureOK>Yes</InsuredSignatureOK></ns2:application>"""]}) 

def xml_processor(s): 
    el = xml.dom.minidom.parseString(s) 
    tag = el.getElemntByTagName("InsuredSignatureOK")[0] 
    return tag.childNodes[0].data 

df.yourColumn = df.yourColumn.apply(xml_processor)
Note, xml_processor I just wrote is very specific, and you probably will need to write your own and use
try/except blocks to handle cases when data/xml-string is corrupted (or has unexpected structure).
Thanks for your reply! I ended up finding a simpler approach and thought I would share for anyone dealing with XML string (although it does use a loop):

import pandas as pd
import xml.etree.ElementTree as ET

#establish dataframe
df = pd.DataFrame(myTable)

for x, row in df.iterrows() :
    myroot = ET.fromstring(row['myColumn']
    for InsuredSignatureOK in myroot.iter('InsuredSignatureOK') :
        print(InsuredSignatureOK.text)