Bottom Page

Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
 Parse XML String in Pandas Dataframe
#1
Here is my situation:

I have a pandas dataframe that contains one column with an xml string for each row. I need to be able to parse the xml string for each row to see the data elements of the xml file. All the code I have been able to find is code to parse an actual xml file. I do not have the xml file, rather just the xml string (below is an example):

<?xml version="1.0" encoding="UTF-8" standalone="yes"?><ns2:application xmlns:ns2="http://www.abc.com/rules/"><InsuredSignatureOK>Yes</InsuredSignatureOK></ns2:application>

How could I return the 'Yes' if I wanted to see InsuredSignatureOK? My only thought was using a loop but I heard that is not the best way to go about it for large dataframes. I have never worked with xml before and am newish to python, so any help is greatly appreciated! Smile
Quote
#2
Not sure, that there is more efficient way to do this, rather than using a loop; First, you need to define a processor, a function which consumes an xml-string and returns a value what you want (extract some value(s) from xml-string, convert them etc.).

def xml_processor(xml_string): 
    # do processing
    return "The value what you want"  
There are different ways to write such a function. If xml-string has relatively simple structure, you can try to build a regular expression which do the work. For example, if you want to extract text within tag "InsuredSignatureOK" ('Yes' in the example above), you can define a regular expression for this. No special xml-parsing libraries will be needed in this case. However, this approach will work only in simple cases. Otherwise, you will need to use libraries for parsing xml-documents. You can use xml package -- which is the part of Python, or install lxml (for example).
Here is minimal working example:

import pandas as pd
import xml
 
df = pd.DataFrame({"yourColumn": ["""<?xml version="1.0" encoding="UTF-8" standalone="yes"?><ns2:application xmlns:ns2="http://www.abc.com/rules/"><InsuredSignatureOK>Yes</InsuredSignatureOK></ns2:application> """, """<?xml version="1.0" encoding="UTF-8" standalone="yes"?><ns2:application xmlns:ns2="http://www.abc.com/rules/"><InsuredSignatureOK>Yes</InsuredSignatureOK></ns2:application>"""]}) 

def xml_processor(s): 
    el = xml.dom.minidom.parseString(s) 
    tag = el.getElemntByTagName("InsuredSignatureOK")[0] 
    return tag.childNodes[0].data 

df.yourColumn = df.yourColumn.apply(xml_processor)
Note, xml_processor I just wrote is very specific, and you probably will need to write your own and use
try/except blocks to handle cases when data/xml-string is corrupted (or has unexpected structure).
Quote
#3
Thanks for your reply! I ended up finding a simpler approach and thought I would share for anyone dealing with XML string (although it does use a loop):

import pandas as pd
import xml.etree.ElementTree as ET

#establish dataframe
df = pd.DataFrame(myTable)

for x, row in df.iterrows() :
    myroot = ET.fromstring(row['myColumn']
    for InsuredSignatureOK in myroot.iter('InsuredSignatureOK') :
        print(InsuredSignatureOK.text)
Quote

Top Page

Possibly Related Threads...
Thread Author Replies Views Last Post
  strange error from pandas dataframe djf123 1 286 Jul-27-2020, 05:25 AM
Last Post: scidam
  Pandas DataFrame not updating HelpMePlease 3 276 Jul-11-2020, 07:19 PM
Last Post: jefsummers
  Pandas DataFrame visual Truman 8 344 Jul-10-2020, 06:11 AM
Last Post: hussainmujtaba
  Pandas DataFrame and unmatched column sritsv19 0 281 Jul-07-2020, 12:52 PM
Last Post: sritsv19
  Pandas DataFrame Concatenate problems Kristenl2784 1 207 Jul-01-2020, 01:28 AM
Last Post: hussainmujtaba
  Difference of two columns in Pandas dataframe zinho 2 510 Jun-17-2020, 03:36 PM
Last Post: zinho
  error bars with dataframe and pandas Hucky 4 433 Apr-27-2020, 02:02 AM
Last Post: Hucky
  Python Pandas DataFrame Help AmericanEagle1989 1 275 Apr-12-2020, 12:37 PM
Last Post: AmericanEagle1989
  How does pyplot know what was plotted by the output of pandas.DataFrame(...).cumprod( codeowl 2 305 Mar-28-2020, 08:27 AM
Last Post: j.crater
  Ordering of pandas DataFrame new_to_python 5 362 Mar-15-2020, 06:08 PM
Last Post: new_to_python

Forum Jump:


Users browsing this thread: 1 Guest(s)