Bottom Page

Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
 Parse XML String in Pandas Dataframe
#1
Here is my situation:

I have a pandas dataframe that contains one column with an xml string for each row. I need to be able to parse the xml string for each row to see the data elements of the xml file. All the code I have been able to find is code to parse an actual xml file. I do not have the xml file, rather just the xml string (below is an example):

<?xml version="1.0" encoding="UTF-8" standalone="yes"?><ns2:application xmlns:ns2="http://www.abc.com/rules/"><InsuredSignatureOK>Yes</InsuredSignatureOK></ns2:application>

How could I return the 'Yes' if I wanted to see InsuredSignatureOK? My only thought was using a loop but I heard that is not the best way to go about it for large dataframes. I have never worked with xml before and am newish to python, so any help is greatly appreciated! Smile
Quote
#2
Not sure, that there is more efficient way to do this, rather than using a loop; First, you need to define a processor, a function which consumes an xml-string and returns a value what you want (extract some value(s) from xml-string, convert them etc.).

def xml_processor(xml_string): 
    # do processing
    return "The value what you want"  
There are different ways to write such a function. If xml-string has relatively simple structure, you can try to build a regular expression which do the work. For example, if you want to extract text within tag "InsuredSignatureOK" ('Yes' in the example above), you can define a regular expression for this. No special xml-parsing libraries will be needed in this case. However, this approach will work only in simple cases. Otherwise, you will need to use libraries for parsing xml-documents. You can use xml package -- which is the part of Python, or install lxml (for example).
Here is minimal working example:

import pandas as pd
import xml
 
df = pd.DataFrame({"yourColumn": ["""<?xml version="1.0" encoding="UTF-8" standalone="yes"?><ns2:application xmlns:ns2="http://www.abc.com/rules/"><InsuredSignatureOK>Yes</InsuredSignatureOK></ns2:application> """, """<?xml version="1.0" encoding="UTF-8" standalone="yes"?><ns2:application xmlns:ns2="http://www.abc.com/rules/"><InsuredSignatureOK>Yes</InsuredSignatureOK></ns2:application>"""]}) 

def xml_processor(s): 
    el = xml.dom.minidom.parseString(s) 
    tag = el.getElemntByTagName("InsuredSignatureOK")[0] 
    return tag.childNodes[0].data 

df.yourColumn = df.yourColumn.apply(xml_processor)
Note, xml_processor I just wrote is very specific, and you probably will need to write your own and use
try/except blocks to handle cases when data/xml-string is corrupted (or has unexpected structure).
Quote
#3
Thanks for your reply! I ended up finding a simpler approach and thought I would share for anyone dealing with XML string (although it does use a loop):

import pandas as pd
import xml.etree.ElementTree as ET

#establish dataframe
df = pd.DataFrame(myTable)

for x, row in df.iterrows() :
    myroot = ET.fromstring(row['myColumn']
    for InsuredSignatureOK in myroot.iter('InsuredSignatureOK') :
        print(InsuredSignatureOK.text)
Quote

Top Page

Possibly Related Threads...
Thread Author Replies Views Last Post
  Pandas dataframe to join three tables using like condition among them sandeep_ganga 0 159 Nov-29-2019, 08:30 AM
Last Post: sandeep_ganga
  Pandas Dataframe to Google Big Query Ecniv 2 730 Nov-21-2019, 02:26 PM
Last Post: Ecniv
  manipulating a dataframe - pandas nsx200 2 162 Nov-14-2019, 10:38 AM
Last Post: nsx200
  Pandas dataframe columns collapsed in Spyder when printing UniKlixX 2 139 Nov-04-2019, 07:00 AM
Last Post: UniKlixX
  pandas dataframe iloc mystery edvvardbrian 2 210 Oct-29-2019, 02:55 PM
Last Post: jefsummers
  How to add a few empty rows into a pandas dataframe python_newbie09 2 762 Sep-20-2019, 08:52 AM
Last Post: python_newbie09
  Dropping a column from pandas dataframe marco_ita 6 1,010 Sep-07-2019, 08:36 AM
Last Post: marco_ita
  created a pandas series instead of pandas DataFrame ibaad1406 6 675 Sep-06-2019, 06:23 AM
Last Post: ibaad1406
  Applying operation to a pandas multi index dataframe subgroup Nuovoq 1 401 Sep-04-2019, 10:04 PM
Last Post: Nuovoq
  Substr on Pandas Dataframe Scott 1 456 Sep-02-2019, 02:49 AM
Last Post: scidam

Forum Jump:


Users browsing this thread: 1 Guest(s)