Python Forum
read individual nodes from an xml url using pandas
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
read individual nodes from an xml url using pandas
#1
I am trying to read an XML file and access one specific attribute, in this case the DisplayName attribute, and use it to create a dataframe in Pandas. So far I've tried the following code:

import xml.etree.ElementTree as et 

xtree = et.parse("XMLdata.xml")
xroot = xtree.getroot()

df_col = ["DAF"]
df_rows = []
for node in xroot:
    is_DAF = node.find("DisplayName").text
    df_rows.append({"DAF":is_DAF})
out_df = pd.DataFrame(df_rows, columns = df_cols)
out_df
but I'm getting this error message:
AttributeError: 'NoneType' object has no attribute 'text'
I've tried replacing
node.find("DisplayName").text
with
node.attrib.get("DisplayName)
which doesn't return any errors but the dataframe contains "None" when the value should be "jalynne.k.archibald" as you can see from the sample XML file I attached below.

I also had another related question: currently I am doing this via an XML file stored on my computer, but how could I do this with an XML URL? like: https://s3.amazonaws.com/irs-form-990/20...public.xml

I appreciate any feedback and alternative suggestions anybody can provide. Thank you!
Reply
#2
I don't see the XML file you're trying to parse..
Reply
#3
(Jul-05-2020, 04:31 AM)bowlofred Wrote: I don't see the XML file you're trying to parse..
Sorry, here is a small section of it:

<ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
<Name>irs-form-990</Name>
<Prefix/>
<Marker/>
<MaxKeys>1000</MaxKeys>
<IsTruncated>true</IsTruncated>
<Contents>
<Key>200931393493000150_public.xml</Key>
<LastModified>2016-03-22T05:43:12.000Z</LastModified>
<ETag>"20b9640ea83d4b6838aee1e03187a15c"</ETag>
<Size>44204</Size>
<Owner>
<ID>0ae4c12b9d736edf3dd39976339f7099d887b890b4480029dd57e97b00d7070a</ID>
<DisplayName>jalynne.k.archibald</DisplayName>
Reply
#4
That's not a valid XML file. Can you provide a valid XML file that this should work on? I looked at your link to an XML in the first post, but that one doesn't have the same structure (like owner/ID/DisplayName).
Reply
#5
(Jul-05-2020, 05:57 AM)bowlofred Wrote: That's not a valid XML file. Can you provide a valid XML file that this should work on? I looked at your link to an XML in the first post, but that one doesn't have the same structure (like owner/ID/DisplayName).
I was just using the other one as an example. Let's say I use the link I provided, because that's the one I actually want to use. How would I retrieve the value for <DonorAdvisedFundInd>? And how do I access it as a link, because I only know how to access it if it is a file stored in my computer.


Edit: I actually figured out how to access a link, so now my updated code is below: but I'm still getting an error message: Errno 36 File name too long
import xml.etree.ElementTree as et
import requests

xml_data = requests.get("https://s3.amazonaws.com/irs-form-990/201903199349320465_public.xml").content

xtree = et.parse(xml_data)
xroot = xtree.getroot()

df_cols = ["DAF"]
df_rows = []
for node in xroot:
    is_DAF = node.attrib.get("DonorAdvisedFundInd")
    df_rows.append({"DAF":is_DAF})
out_df = pd.DataFrame(df_rows, columns=df_cols)
out_df
Reply
#6
It will now read all content as filename,can use io.StringIO() then pass file object to et.parse().
That dos not work as ElementTree is to strict to validate that xml not well-formed error.
As i have mention many times before so is XML/HTML libraries not so good,i never use them.
import requests
from bs4 import BeautifulSoup

url = 'https://s3.amazonaws.com/irs-form-990/201903199349320465_public.xml'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'xml')
donor = soup.find('DonorAdvisedFundInd')
>>> donor
<DonorAdvisedFundInd referenceDocumentId="RetDoc1040000001">0</DonorAdvisedFundInd>
>>> donor.text
'0'
>>> donor.get('referenceDocumentId')
'RetDoc1040000001'
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  How can I rearrange df as the nodes index in pytorch geometric manner? uqlsmey 0 508 Jul-31-2023, 11:28 AM
Last Post: uqlsmey
  How to expand and collapse individual parts of the code in Atom Lora 2 1,144 Oct-06-2022, 07:32 AM
Last Post: Lora
  Python3 binary tree not replacing parent nodes with child nodes Aspect11 0 1,762 Sep-23-2020, 02:22 PM
Last Post: Aspect11
  Modifying anytree Nodes gw1500se 1 2,641 Jun-05-2020, 03:44 PM
Last Post: Gribouillis
  I am trying to read a pandas file Balaji 1 1,936 Oct-08-2019, 10:55 PM
Last Post: Larz60+
  Animate graph nodes inside a function adamG 0 2,943 Sep-23-2019, 11:18 AM
Last Post: adamG
  Loop files - Extract List Data To Individual Columns in CSV dj99 5 3,250 May-19-2019, 10:29 AM
Last Post: dj99
  Slicing Python list of strings into individual characters Drone4four 5 3,479 Apr-17-2019, 07:22 AM
Last Post: perfringo
  Comms for multiple nodes. MuntyScruntfundle 1 1,889 Feb-18-2019, 03:54 PM
Last Post: Larz60+
  Extract Strings From Text File - Out Put Results to Individual Files dj99 8 4,930 Jun-28-2018, 10:41 AM
Last Post: dj99

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020