read individual nodes from an xml url using pandas

mattkaplan27 · Jul-05-2020, 12:49 AM

I am trying to read an XML file and access one specific attribute, in this case the DisplayName attribute, and use it to create a dataframe in Pandas. So far I've tried the following code:

import xml.etree.ElementTree as et 

xtree = et.parse("XMLdata.xml")
xroot = xtree.getroot()

df_col = ["DAF"]
df_rows = []
for node in xroot:
    is_DAF = node.find("DisplayName").text
    df_rows.append({"DAF":is_DAF})
out_df = pd.DataFrame(df_rows, columns = df_cols)
out_df

but I'm getting this error message:

AttributeError: 'NoneType' object has no attribute 'text'

I've tried replacing

node.find("DisplayName").text

with

node.attrib.get("DisplayName)

which doesn't return any errors but the dataframe contains "None" when the value should be "jalynne.k.archibald" as you can see from the sample XML file I attached below.

I also had another related question: currently I am doing this via an XML file stored on my computer, but how could I do this with an XML URL? like: https://s3.amazonaws.com/irs-form-990/20...public.xml

I appreciate any feedback and alternative suggestions anybody can provide. Thank you!

bowlofred · Jul-05-2020, 04:31 AM

I don't see the XML file you're trying to parse..

mattkaplan27 · Jul-05-2020, 04:43 AM

(Jul-05-2020, 04:31 AM)bowlofred Wrote: I don't see the XML file you're trying to parse..

Sorry, here is a small section of it:

<ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
<Name>irs-form-990</Name>
<Prefix/>
<Marker/>
<MaxKeys>1000</MaxKeys>
<IsTruncated>true</IsTruncated>
<Contents>
<Key>200931393493000150_public.xml</Key>
<LastModified>2016-03-22T05:43:12.000Z</LastModified>
<ETag>"20b9640ea83d4b6838aee1e03187a15c"</ETag>
<Size>44204</Size>
<Owner>
<ID>0ae4c12b9d736edf3dd39976339f7099d887b890b4480029dd57e97b00d7070a</ID>
<DisplayName>jalynne.k.archibald</DisplayName>

bowlofred · Jul-05-2020, 05:57 AM

That's not a valid XML file. Can you provide a valid XML file that this should work on? I looked at your link to an XML in the first post, but that one doesn't have the same structure (like owner/ID/DisplayName).

mattkaplan27 · (This post was last modified: Jul-05-2020, 06:11 AM by mattkaplan27.)

(Jul-05-2020, 05:57 AM)bowlofred Wrote: That's not a valid XML file. Can you provide a valid XML file that this should work on? I looked at your link to an XML in the first post, but that one doesn't have the same structure (like owner/ID/DisplayName).

I was just using the other one as an example. Let's say I use the link I provided, because that's the one I actually want to use. How would I retrieve the value for <DonorAdvisedFundInd>? And how do I access it as a link, because I only know how to access it if it is a file stored in my computer.

Edit: I actually figured out how to access a link, so now my updated code is below: but I'm still getting an error message: Errno 36 File name too long

import xml.etree.ElementTree as et
import requests

xml_data = requests.get("https://s3.amazonaws.com/irs-form-990/201903199349320465_public.xml").content

xtree = et.parse(xml_data)
xroot = xtree.getroot()

df_cols = ["DAF"]
df_rows = []
for node in xroot:
    is_DAF = node.attrib.get("DonorAdvisedFundInd")
    df_rows.append({"DAF":is_DAF})
out_df = pd.DataFrame(df_rows, columns=df_cols)
out_df

***snippsat*** · (This post was last modified: Jul-05-2020, 10:07 PM by snippsat.)

It will now read all content as filename,can use io.StringIO() then pass file object to et.parse().
That dos not work as ElementTree is to strict to validate that xml not well-formed error.
As i have mention many times before so is XML/HTML libraries not so good,i never use them.

import requests
from bs4 import BeautifulSoup

url = 'https://s3.amazonaws.com/irs-form-990/201903199349320465_public.xml'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'xml')
donor = soup.find('DonorAdvisedFundInd')

>>> donor
<DonorAdvisedFundInd referenceDocumentId="RetDoc1040000001">0</DonorAdvisedFundInd>
>>> donor.text
'0'
>>> donor.get('referenceDocumentId')
'RetDoc1040000001'

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	How can I rearrange df as the nodes index in pytorch geometric manner?	uqlsmey	0	508	Jul-31-2023, 11:28 AM Last Post: uqlsmey
	How to expand and collapse individual parts of the code in Atom	Lora	2	1,144	Oct-06-2022, 07:32 AM Last Post: Lora
	Python3 binary tree not replacing parent nodes with child nodes	Aspect11	0	1,762	Sep-23-2020, 02:22 PM Last Post: Aspect11
	Modifying anytree Nodes	gw1500se	1	2,641	Jun-05-2020, 03:44 PM Last Post: Gribouillis
	I am trying to read a pandas file	Balaji	1	1,936	Oct-08-2019, 10:55 PM Last Post: Larz60+
	Animate graph nodes inside a function	adamG	0	2,943	Sep-23-2019, 11:18 AM Last Post: adamG
	Loop files - Extract List Data To Individual Columns in CSV	dj99	5	3,250	May-19-2019, 10:29 AM Last Post: dj99
	Slicing Python list of strings into individual characters	Drone4four	5	3,479	Apr-17-2019, 07:22 AM Last Post: perfringo
	Comms for multiple nodes.	MuntyScruntfundle	1	1,889	Feb-18-2019, 03:54 PM Last Post: Larz60+
	Extract Strings From Text File - Out Put Results to Individual Files	dj99	8	4,930	Jun-28-2018, 10:41 AM Last Post: dj99

read individual nodes from an xml url using pandas

User Panel Messages

Announcements