Parse XML String in Pandas Dataframe - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Data Science (https://python-forum.io/forum-44.html) +--- Thread: Parse XML String in Pandas Dataframe (/thread-22948.html) |
Parse XML String in Pandas Dataframe - creedX - Dec-04-2019 Here is my situation: I have a pandas dataframe that contains one column with an xml string for each row. I need to be able to parse the xml string for each row to see the data elements of the xml file. All the code I have been able to find is code to parse an actual xml file. I do not have the xml file, rather just the xml string (below is an example): <?xml version="1.0" encoding="UTF-8" standalone="yes"?><ns2:application xmlns:ns2="http://www.abc.com/rules/"><InsuredSignatureOK>Yes</InsuredSignatureOK></ns2:application> How could I return the 'Yes' if I wanted to see InsuredSignatureOK? My only thought was using a loop but I heard that is not the best way to go about it for large dataframes. I have never worked with xml before and am newish to python, so any help is greatly appreciated! RE: Parse XML String in Pandas Dataframe - scidam - Dec-05-2019 Not sure, that there is more efficient way to do this, rather than using a loop; First, you need to define a processor, a function which consumes an xml-string and returns a value what you want (extract some value(s) from xml-string, convert them etc.). def xml_processor(xml_string): # do processing return "The value what you want"There are different ways to write such a function. If xml-string has relatively simple structure, you can try to build a regular expression which do the work. For example, if you want to extract text within tag "InsuredSignatureOK" ('Yes' in the example above), you can define a regular expression for this. No special xml-parsing libraries will be needed in this case. However, this approach will work only in simple cases. Otherwise, you will need to use libraries for parsing xml-documents. You can use xml package -- which is the part of Python, or install lxml (for example).Here is minimal working example: import pandas as pd import xml df = pd.DataFrame({"yourColumn": ["""<?xml version="1.0" encoding="UTF-8" standalone="yes"?><ns2:application xmlns:ns2="http://www.abc.com/rules/"><InsuredSignatureOK>Yes</InsuredSignatureOK></ns2:application> """, """<?xml version="1.0" encoding="UTF-8" standalone="yes"?><ns2:application xmlns:ns2="http://www.abc.com/rules/"><InsuredSignatureOK>Yes</InsuredSignatureOK></ns2:application>"""]}) def xml_processor(s): el = xml.dom.minidom.parseString(s) tag = el.getElemntByTagName("InsuredSignatureOK")[0] return tag.childNodes[0].data df.yourColumn = df.yourColumn.apply(xml_processor)Note, xml_processor I just wrote is very specific, and you probably will need to write your own and usetry/except blocks to handle cases when data/xml-string is corrupted (or has unexpected structure). RE: Parse XML String in Pandas Dataframe - creedX - Dec-09-2019 Thanks for your reply! I ended up finding a simpler approach and thought I would share for anyone dealing with XML string (although it does use a loop): import pandas as pd import xml.etree.ElementTree as ET #establish dataframe df = pd.DataFrame(myTable) for x, row in df.iterrows() : myroot = ET.fromstring(row['myColumn'] for InsuredSignatureOK in myroot.iter('InsuredSignatureOK') : print(InsuredSignatureOK.text) |