Python Forum

Full Version: Fetching and Parsing XML Data
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hello everyone,

I was just making this post to see if there is a better way to do certain things in my code for fetching and parsing data from an XML API. I'm using an Australian environment air quality .XML file to look at for this process, with the link being: Australian XML Database for Air Quality. (The link for the xml data page is also located in the code snippet below if you do not trust my hyperlink).

First here is the entire python code snippet:
from bs4 import BeautifulSoup as soup
import requests

URL = 'https://environment.des.qld.gov.au/cgi-bin/air/xml.php?category=1&region=ALL'
raw_data = requests.get(URL, verify = True)

xml_soup = soup(raw_data.content, 'xml')

regions = []

South_East_Queensland = xml_soup.findAll('region', {'name': 'South East Queensland'})
South_West_Queensland = xml_soup.findAll('region', {'name': 'South West Queensland'})
Gladstone = xml_soup.findAll('region', {'name': 'Gladstone'})
Mackay = xml_soup.findAll('region', {'name': 'Mackay'})
Townsville = xml_soup.findAll('region', {'name': 'Townsville'})
Mount_Isa = xml_soup.findAll('region', {'name': 'Mount Isa'})

regions.append(South_East_Queensland[0])
regions.append(South_West_Queensland[0])
regions.append(Gladstone[0])
regions.append(Mackay[0])
regions.append(Townsville[0])
regions.append(Mount_Isa[0])

stations = []
for index in range(len(regions)):
    stations.append(regions[index].findAll('station'))

headers = 'Region, Station Name, Nitrogen Dioxide, Ozone, Sulfur Dioxide, Carbon Monoxide, Particle PM10, Particle PM2.5, Particles TSP, Visibility\n'

file_name = 'Air Quality.csv'

f = open(file_name, 'w')

f.write(headers)

index = 0
for region in regions:
    region_string = ''
    region_name = region['name'] + ','
    for station in stations[index]:
        station_string = ''
        station_name = station['name'] + ','
        nd = station.findAll('measurement', {'name': 'Nitrogen Dioxide'})
        nd = ',' if len(nd) == 0 else str(nd[0].text) + ','
        o = station.findAll('measurement', {'name': 'Ozone'})
        o = ',' if len(o) == 0 else str(o[0].text) + ','    
        sd = station.findAll('measurement', {'name': 'Sulfur Dioxide'})
        sd = ',' if len(sd) == 0 else str(sd[0].text) + ','    
        cm = station.findAll('measurement', {'name': 'Carbon Monoxide'})
        cm = ',' if len(cm) == 0 else str(cm[0].text) + ','    
        ppm10 = station.findAll('measurement', {'name': 'Particle PM10'})
        ppm10 = ',' if len(ppm10) == 0 else str(ppm10[0].text) + ','    
        ppm2 = station.findAll('measurement', {'name': 'Particle PM2.5'})
        ppm2 = ',' if len(ppm2) == 0 else str(ppm2[0].text) + ','    
        ptsp = station.findAll('measurement', {'name': 'Particles TSP'})
        ptsp = ',' if len(ptsp) == 0 else str(ptsp[0].text) + ','    
        v = station.findAll('measurement', {'name': 'Visibility'})
        v = '\n' if len(v) == 0 else str(v[0].text) + '\n'
        region_string += region_name + station_name + nd + o + sd + cm + ppm10 + ppm2 + ptsp + v
        f.write(region_string)
    index += 1
f.close()
To begin, I'm not really concerning over classes and objects for my programming at this time but if anyone has any input there I am happy to hear it so I can improve my methodology here.

Likely, my biggest issue is for loops and setting up a new list in regards to the regional and station data.

In regards to the code snippet below, is there a better way to fetch the regional data and put it into an array/list with each region in its proper location so it can be easily accessed later instead of pretty much having to place it into a list myself? I initially thought a for loop might be useful but I couldn't think how I could do it logically (I'm not that great at for loops at the moment, mainly due to them being so different from other languages I've messed around with, but I am learning!). Any input here would be great thanks!:
regions = []

South_East_Queensland = xml_soup.findAll('region', {'name': 'South East Queensland'})
South_West_Queensland = xml_soup.findAll('region', {'name': 'South West Queensland'})
Gladstone = xml_soup.findAll('region', {'name': 'Gladstone'})
Mackay = xml_soup.findAll('region', {'name': 'Mackay'})
Townsville = xml_soup.findAll('region', {'name': 'Townsville'})
Mount_Isa = xml_soup.findAll('region', {'name': 'Mount Isa'})

regions.append(South_East_Queensland[0])
regions.append(South_West_Queensland[0])
regions.append(Gladstone[0])
regions.append(Mackay[0])
regions.append(Townsville[0])
regions.append(Mount_Isa[0])

stations = []
for index in range(len(regions)):
    stations.append(regions[index].findAll('station'))
Lastly, I was curious if this for loop at the end of my code for writing the data to the .csv file is the most efficient way to do this because, to be honest, it seemed a little drawn out to me and I feel maybe there would be a way to shorten this process?:

file_name = 'Air Quality.csv'

f = open(file_name, 'w')

f.write(headers)

index = 0
for region in regions:
    region_string = ''
    region_name = region['name'] + ','
    for station in stations[index]:
        station_string = ''
        station_name = station['name'] + ','
        nd = station.findAll('measurement', {'name': 'Nitrogen Dioxide'})
        nd = ',' if len(nd) == 0 else str(nd[0].text) + ','
        o = station.findAll('measurement', {'name': 'Ozone'})
        o = ',' if len(o) == 0 else str(o[0].text) + ','    
        sd = station.findAll('measurement', {'name': 'Sulfur Dioxide'})
        sd = ',' if len(sd) == 0 else str(sd[0].text) + ','    
        cm = station.findAll('measurement', {'name': 'Carbon Monoxide'})
        cm = ',' if len(cm) == 0 else str(cm[0].text) + ','    
        ppm10 = station.findAll('measurement', {'name': 'Particle PM10'})
        ppm10 = ',' if len(ppm10) == 0 else str(ppm10[0].text) + ','    
        ppm2 = station.findAll('measurement', {'name': 'Particle PM2.5'})
        ppm2 = ',' if len(ppm2) == 0 else str(ppm2[0].text) + ','    
        ptsp = station.findAll('measurement', {'name': 'Particles TSP'})
        ptsp = ',' if len(ptsp) == 0 else str(ptsp[0].text) + ','    
        v = station.findAll('measurement', {'name': 'Visibility'})
        v = '\n' if len(v) == 0 else str(v[0].text) + '\n'
        region_string += region_name + station_name + nd + o + sd + cm + ppm10 + ppm2 + ptsp + v
        f.write(region_string)
    index += 1
f.close()
I'm having a lot of fun learning how to scrape data off of websites and from different JSON and XML API's, but I do not currently know anyone that is working in data science or does any data collection. Just personally thought this would be a good skill to know in any type of scientific field and would also love any input in regards to how to make it more efficient, what I'm doing wrong that could be better, as well as how to structure it in a way that is considered more "standardized".

Thanks for any input/advice given,

FalseFact
Is there a web page associated with this data?
I'm thinking a selenium scraper might be best.
How would you navigate to the data from this page: environment.des.qld.gov.au
I'll write you a sample selenium scraper it I know (page by page) how to get to the data (what you would click on, or type)
This would allow you to use the same scraper for any of the data available.
(Apr-01-2019, 09:28 AM)Larz60+ Wrote: [ -> ]Is there a web page associated with this data?
Larz, data come as xml response from API, link is in the OP
Quote:Larz, data come as xml response from API, link is in the OP
Yes, I know that. What I was looking for is this:
https://environment.des.qld.gov.au/air/data/search.php