Parse XML String in Pandas Dataframe

**scidam** · Dec-05-2019, 12:50 AM

Not sure, that there is more efficient way to do this, rather than using a loop; First, you need to define a processor, a function which consumes an xml-string and returns a value what you want (extract some value(s) from xml-string, convert them etc.).

def xml_processor(xml_string): 
    # do processing
    return "The value what you want"

There are different ways to write such a function. If xml-string has relatively simple structure, you can try to build a regular expression which do the work. For example, if you want to extract text within tag "InsuredSignatureOK" ('Yes' in the example above), you can define a regular expression for this. No special xml-parsing libraries will be needed in this case. However, this approach will work only in simple cases. Otherwise, you will need to use libraries for parsing xml-documents. You can use xml package -- which is the part of Python, or install lxml (for example).
Here is minimal working example:

import pandas as pd
import xml
 
df = pd.DataFrame({"yourColumn": ["""<?xml version="1.0" encoding="UTF-8" standalone="yes"?><ns2:application xmlns:ns2="http://www.abc.com/rules/"><InsuredSignatureOK>Yes</InsuredSignatureOK></ns2:application> """, """<?xml version="1.0" encoding="UTF-8" standalone="yes"?><ns2:application xmlns:ns2="http://www.abc.com/rules/"><InsuredSignatureOK>Yes</InsuredSignatureOK></ns2:application>"""]}) 

def xml_processor(s): 
    el = xml.dom.minidom.parseString(s) 
    tag = el.getElemntByTagName("InsuredSignatureOK")[0] 
    return tag.childNodes[0].data 

df.yourColumn = df.yourColumn.apply(xml_processor)

Note, xml_processor I just wrote is very specific, and you probably will need to write your own and use
try/except blocks to handle cases when data/xml-string is corrupted (or has unexpected structure).

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	[Solved] Formatting cells of a pandas dataframe into an OpenDocument ods spreadsheet	Calab	1	735	Mar-01-2025, 04:51 AM Last Post: Calab
	Find duplicates in a pandas dataframe list column on other rows	Calab	2	2,266	Sep-18-2024, 07:38 PM Last Post: Calab
	Find strings by index from a list of indexes in a different Pandas dataframe column	Calab	3	1,662	Aug-26-2024, 04:52 PM Last Post: Calab
	Add NER output to pandas dataframe	dg3000	0	1,174	Apr-22-2024, 08:14 PM Last Post: dg3000
	HTML Decoder pandas dataframe column	mbrown009	3	2,729	Sep-29-2023, 05:56 PM Last Post: deanhystad
	Use pandas to obtain cartesian product between a dataframe of int and equations?	haihal	0	2,040	Jan-06-2023, 10:53 PM Last Post: haihal
	Parse Nested JSON String in Python	rwalde	4	5,274	Sep-08-2022, 10:32 AM Last Post: rwalde
	how to parse this array with pandas?	netanelst	1	2,088	May-17-2022, 12:42 PM Last Post: netanelst
	Pandas Dataframe Filtering based on rows	mvdlm	0	2,094	Apr-02-2022, 06:39 PM Last Post: mvdlm
	Pandas dataframe: calculate metrics by year	mcva	1	3,458	Mar-02-2022, 08:22 AM Last Post: mcva

Parse XML String in Pandas Dataframe

User Panel Messages

Announcements