Python Forum

Full Version: read individual nodes from an xml url using pandas
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
I am trying to read an XML file and access one specific attribute, in this case the DisplayName attribute, and use it to create a dataframe in Pandas. So far I've tried the following code:

import xml.etree.ElementTree as et 

xtree = et.parse("XMLdata.xml")
xroot = xtree.getroot()

df_col = ["DAF"]
df_rows = []
for node in xroot:
    is_DAF = node.find("DisplayName").text
    df_rows.append({"DAF":is_DAF})
out_df = pd.DataFrame(df_rows, columns = df_cols)
out_df
but I'm getting this error message:
AttributeError: 'NoneType' object has no attribute 'text'
I've tried replacing
node.find("DisplayName").text
with
node.attrib.get("DisplayName)
which doesn't return any errors but the dataframe contains "None" when the value should be "jalynne.k.archibald" as you can see from the sample XML file I attached below.

I also had another related question: currently I am doing this via an XML file stored on my computer, but how could I do this with an XML URL? like: https://s3.amazonaws.com/irs-form-990/20...public.xml

I appreciate any feedback and alternative suggestions anybody can provide. Thank you!
I don't see the XML file you're trying to parse..
(Jul-05-2020, 04:31 AM)bowlofred Wrote: [ -> ]I don't see the XML file you're trying to parse..
Sorry, here is a small section of it:

<ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
<Name>irs-form-990</Name>
<Prefix/>
<Marker/>
<MaxKeys>1000</MaxKeys>
<IsTruncated>true</IsTruncated>
<Contents>
<Key>200931393493000150_public.xml</Key>
<LastModified>2016-03-22T05:43:12.000Z</LastModified>
<ETag>"20b9640ea83d4b6838aee1e03187a15c"</ETag>
<Size>44204</Size>
<Owner>
<ID>0ae4c12b9d736edf3dd39976339f7099d887b890b4480029dd57e97b00d7070a</ID>
<DisplayName>jalynne.k.archibald</DisplayName>
That's not a valid XML file. Can you provide a valid XML file that this should work on? I looked at your link to an XML in the first post, but that one doesn't have the same structure (like owner/ID/DisplayName).
(Jul-05-2020, 05:57 AM)bowlofred Wrote: [ -> ]That's not a valid XML file. Can you provide a valid XML file that this should work on? I looked at your link to an XML in the first post, but that one doesn't have the same structure (like owner/ID/DisplayName).
I was just using the other one as an example. Let's say I use the link I provided, because that's the one I actually want to use. How would I retrieve the value for <DonorAdvisedFundInd>? And how do I access it as a link, because I only know how to access it if it is a file stored in my computer.


Edit: I actually figured out how to access a link, so now my updated code is below: but I'm still getting an error message: Errno 36 File name too long
import xml.etree.ElementTree as et
import requests

xml_data = requests.get("https://s3.amazonaws.com/irs-form-990/201903199349320465_public.xml").content

xtree = et.parse(xml_data)
xroot = xtree.getroot()

df_cols = ["DAF"]
df_rows = []
for node in xroot:
    is_DAF = node.attrib.get("DonorAdvisedFundInd")
    df_rows.append({"DAF":is_DAF})
out_df = pd.DataFrame(df_rows, columns=df_cols)
out_df
It will now read all content as filename,can use io.StringIO() then pass file object to et.parse().
That dos not work as ElementTree is to strict to validate that xml not well-formed error.
As i have mention many times before so is XML/HTML libraries not so good,i never use them.
import requests
from bs4 import BeautifulSoup

url = 'https://s3.amazonaws.com/irs-form-990/201903199349320465_public.xml'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'xml')
donor = soup.find('DonorAdvisedFundInd')
>>> donor
<DonorAdvisedFundInd referenceDocumentId="RetDoc1040000001">0</DonorAdvisedFundInd>
>>> donor.text
'0'
>>> donor.get('referenceDocumentId')
'RetDoc1040000001'