Bottom Page

Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
 Parse XML String in Pandas Dataframe
Here is my situation:

I have a pandas dataframe that contains one column with an xml string for each row. I need to be able to parse the xml string for each row to see the data elements of the xml file. All the code I have been able to find is code to parse an actual xml file. I do not have the xml file, rather just the xml string (below is an example):

<?xml version="1.0" encoding="UTF-8" standalone="yes"?><ns2:application xmlns:ns2=""><InsuredSignatureOK>Yes</InsuredSignatureOK></ns2:application>

How could I return the 'Yes' if I wanted to see InsuredSignatureOK? My only thought was using a loop but I heard that is not the best way to go about it for large dataframes. I have never worked with xml before and am newish to python, so any help is greatly appreciated! Smile
Not sure, that there is more efficient way to do this, rather than using a loop; First, you need to define a processor, a function which consumes an xml-string and returns a value what you want (extract some value(s) from xml-string, convert them etc.).

def xml_processor(xml_string): 
    # do processing
    return "The value what you want"  
There are different ways to write such a function. If xml-string has relatively simple structure, you can try to build a regular expression which do the work. For example, if you want to extract text within tag "InsuredSignatureOK" ('Yes' in the example above), you can define a regular expression for this. No special xml-parsing libraries will be needed in this case. However, this approach will work only in simple cases. Otherwise, you will need to use libraries for parsing xml-documents. You can use xml package -- which is the part of Python, or install lxml (for example).
Here is minimal working example:

import pandas as pd
import xml
df = pd.DataFrame({"yourColumn": ["""<?xml version="1.0" encoding="UTF-8" standalone="yes"?><ns2:application xmlns:ns2=""><InsuredSignatureOK>Yes</InsuredSignatureOK></ns2:application> """, """<?xml version="1.0" encoding="UTF-8" standalone="yes"?><ns2:application xmlns:ns2=""><InsuredSignatureOK>Yes</InsuredSignatureOK></ns2:application>"""]}) 

def xml_processor(s): 
    el = xml.dom.minidom.parseString(s) 
    tag = el.getElemntByTagName("InsuredSignatureOK")[0] 
    return tag.childNodes[0].data 

df.yourColumn = df.yourColumn.apply(xml_processor)
Note, xml_processor I just wrote is very specific, and you probably will need to write your own and use
try/except blocks to handle cases when data/xml-string is corrupted (or has unexpected structure).
Thanks for your reply! I ended up finding a simpler approach and thought I would share for anyone dealing with XML string (although it does use a loop):

import pandas as pd
import xml.etree.ElementTree as ET

#establish dataframe
df = pd.DataFrame(myTable)

for x, row in df.iterrows() :
    myroot = ET.fromstring(row['myColumn']
    for InsuredSignatureOK in myroot.iter('InsuredSignatureOK') :

Top Page

Possibly Related Threads...
Thread Author Replies Views Last Post
  Pandas dataframe to join three tables using like condition among them sandeep_ganga 0 159 Nov-29-2019, 08:30 AM
Last Post: sandeep_ganga
  Pandas Dataframe to Google Big Query Ecniv 2 730 Nov-21-2019, 02:26 PM
Last Post: Ecniv
  manipulating a dataframe - pandas nsx200 2 162 Nov-14-2019, 10:38 AM
Last Post: nsx200
  Pandas dataframe columns collapsed in Spyder when printing UniKlixX 2 139 Nov-04-2019, 07:00 AM
Last Post: UniKlixX
  pandas dataframe iloc mystery edvvardbrian 2 210 Oct-29-2019, 02:55 PM
Last Post: jefsummers
  How to add a few empty rows into a pandas dataframe python_newbie09 2 762 Sep-20-2019, 08:52 AM
Last Post: python_newbie09
  Dropping a column from pandas dataframe marco_ita 6 1,010 Sep-07-2019, 08:36 AM
Last Post: marco_ita
  created a pandas series instead of pandas DataFrame ibaad1406 6 675 Sep-06-2019, 06:23 AM
Last Post: ibaad1406
  Applying operation to a pandas multi index dataframe subgroup Nuovoq 1 401 Sep-04-2019, 10:04 PM
Last Post: Nuovoq
  Substr on Pandas Dataframe Scott 1 456 Sep-02-2019, 02:49 AM
Last Post: scidam

Forum Jump:

Users browsing this thread: 1 Guest(s)