Python Forum
read individual nodes from an xml url using pandas - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: read individual nodes from an xml url using pandas (/thread-28097.html)



read individual nodes from an xml url using pandas - mattkaplan27 - Jul-05-2020

I am trying to read an XML file and access one specific attribute, in this case the DisplayName attribute, and use it to create a dataframe in Pandas. So far I've tried the following code:

import xml.etree.ElementTree as et 

xtree = et.parse("XMLdata.xml")
xroot = xtree.getroot()

df_col = ["DAF"]
df_rows = []
for node in xroot:
    is_DAF = node.find("DisplayName").text
    df_rows.append({"DAF":is_DAF})
out_df = pd.DataFrame(df_rows, columns = df_cols)
out_df
but I'm getting this error message:
AttributeError: 'NoneType' object has no attribute 'text'
I've tried replacing
node.find("DisplayName").text
with
node.attrib.get("DisplayName)
which doesn't return any errors but the dataframe contains "None" when the value should be "jalynne.k.archibald" as you can see from the sample XML file I attached below.

I also had another related question: currently I am doing this via an XML file stored on my computer, but how could I do this with an XML URL? like: https://s3.amazonaws.com/irs-form-990/201903199349320465_public.xml

I appreciate any feedback and alternative suggestions anybody can provide. Thank you!


RE: read individual nodes from an xml url using pandas - bowlofred - Jul-05-2020

I don't see the XML file you're trying to parse..


RE: read individual nodes from an xml url using pandas - mattkaplan27 - Jul-05-2020

(Jul-05-2020, 04:31 AM)bowlofred Wrote: I don't see the XML file you're trying to parse..
Sorry, here is a small section of it:

<ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
<Name>irs-form-990</Name>
<Prefix/>
<Marker/>
<MaxKeys>1000</MaxKeys>
<IsTruncated>true</IsTruncated>
<Contents>
<Key>200931393493000150_public.xml</Key>
<LastModified>2016-03-22T05:43:12.000Z</LastModified>
<ETag>"20b9640ea83d4b6838aee1e03187a15c"</ETag>
<Size>44204</Size>
<Owner>
<ID>0ae4c12b9d736edf3dd39976339f7099d887b890b4480029dd57e97b00d7070a</ID>
<DisplayName>jalynne.k.archibald</DisplayName>


RE: read individual nodes from an xml url using pandas - bowlofred - Jul-05-2020

That's not a valid XML file. Can you provide a valid XML file that this should work on? I looked at your link to an XML in the first post, but that one doesn't have the same structure (like owner/ID/DisplayName).


RE: read individual nodes from an xml url using pandas - mattkaplan27 - Jul-05-2020

(Jul-05-2020, 05:57 AM)bowlofred Wrote: That's not a valid XML file. Can you provide a valid XML file that this should work on? I looked at your link to an XML in the first post, but that one doesn't have the same structure (like owner/ID/DisplayName).
I was just using the other one as an example. Let's say I use the link I provided, because that's the one I actually want to use. How would I retrieve the value for <DonorAdvisedFundInd>? And how do I access it as a link, because I only know how to access it if it is a file stored in my computer.


Edit: I actually figured out how to access a link, so now my updated code is below: but I'm still getting an error message: Errno 36 File name too long
import xml.etree.ElementTree as et
import requests

xml_data = requests.get("https://s3.amazonaws.com/irs-form-990/201903199349320465_public.xml").content

xtree = et.parse(xml_data)
xroot = xtree.getroot()

df_cols = ["DAF"]
df_rows = []
for node in xroot:
    is_DAF = node.attrib.get("DonorAdvisedFundInd")
    df_rows.append({"DAF":is_DAF})
out_df = pd.DataFrame(df_rows, columns=df_cols)
out_df



RE: read individual nodes from an xml url using pandas - snippsat - Jul-05-2020

It will now read all content as filename,can use io.StringIO() then pass file object to et.parse().
That dos not work as ElementTree is to strict to validate that xml not well-formed error.
As i have mention many times before so is XML/HTML libraries not so good,i never use them.
import requests
from bs4 import BeautifulSoup

url = 'https://s3.amazonaws.com/irs-form-990/201903199349320465_public.xml'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'xml')
donor = soup.find('DonorAdvisedFundInd')
>>> donor
<DonorAdvisedFundInd referenceDocumentId="RetDoc1040000001">0</DonorAdvisedFundInd>
>>> donor.text
'0'
>>> donor.get('referenceDocumentId')
'RetDoc1040000001'