Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Fetching and Parsing XML Data
#1
Hello everyone,

I was just making this post to see if there is a better way to do certain things in my code for fetching and parsing data from an XML API. I'm using an Australian environment air quality .XML file to look at for this process, with the link being: Australian XML Database for Air Quality. (The link for the xml data page is also located in the code snippet below if you do not trust my hyperlink).

First here is the entire python code snippet:
from bs4 import BeautifulSoup as soup
import requests

URL = 'https://environment.des.qld.gov.au/cgi-bin/air/xml.php?category=1&region=ALL'
raw_data = requests.get(URL, verify = True)

xml_soup = soup(raw_data.content, 'xml')

regions = []

South_East_Queensland = xml_soup.findAll('region', {'name': 'South East Queensland'})
South_West_Queensland = xml_soup.findAll('region', {'name': 'South West Queensland'})
Gladstone = xml_soup.findAll('region', {'name': 'Gladstone'})
Mackay = xml_soup.findAll('region', {'name': 'Mackay'})
Townsville = xml_soup.findAll('region', {'name': 'Townsville'})
Mount_Isa = xml_soup.findAll('region', {'name': 'Mount Isa'})

regions.append(South_East_Queensland[0])
regions.append(South_West_Queensland[0])
regions.append(Gladstone[0])
regions.append(Mackay[0])
regions.append(Townsville[0])
regions.append(Mount_Isa[0])

stations = []
for index in range(len(regions)):
    stations.append(regions[index].findAll('station'))

headers = 'Region, Station Name, Nitrogen Dioxide, Ozone, Sulfur Dioxide, Carbon Monoxide, Particle PM10, Particle PM2.5, Particles TSP, Visibility\n'

file_name = 'Air Quality.csv'

f = open(file_name, 'w')

f.write(headers)

index = 0
for region in regions:
    region_string = ''
    region_name = region['name'] + ','
    for station in stations[index]:
        station_string = ''
        station_name = station['name'] + ','
        nd = station.findAll('measurement', {'name': 'Nitrogen Dioxide'})
        nd = ',' if len(nd) == 0 else str(nd[0].text) + ','
        o = station.findAll('measurement', {'name': 'Ozone'})
        o = ',' if len(o) == 0 else str(o[0].text) + ','    
        sd = station.findAll('measurement', {'name': 'Sulfur Dioxide'})
        sd = ',' if len(sd) == 0 else str(sd[0].text) + ','    
        cm = station.findAll('measurement', {'name': 'Carbon Monoxide'})
        cm = ',' if len(cm) == 0 else str(cm[0].text) + ','    
        ppm10 = station.findAll('measurement', {'name': 'Particle PM10'})
        ppm10 = ',' if len(ppm10) == 0 else str(ppm10[0].text) + ','    
        ppm2 = station.findAll('measurement', {'name': 'Particle PM2.5'})
        ppm2 = ',' if len(ppm2) == 0 else str(ppm2[0].text) + ','    
        ptsp = station.findAll('measurement', {'name': 'Particles TSP'})
        ptsp = ',' if len(ptsp) == 0 else str(ptsp[0].text) + ','    
        v = station.findAll('measurement', {'name': 'Visibility'})
        v = '\n' if len(v) == 0 else str(v[0].text) + '\n'
        region_string += region_name + station_name + nd + o + sd + cm + ppm10 + ppm2 + ptsp + v
        f.write(region_string)
    index += 1
f.close()
To begin, I'm not really concerning over classes and objects for my programming at this time but if anyone has any input there I am happy to hear it so I can improve my methodology here.

Likely, my biggest issue is for loops and setting up a new list in regards to the regional and station data.

In regards to the code snippet below, is there a better way to fetch the regional data and put it into an array/list with each region in its proper location so it can be easily accessed later instead of pretty much having to place it into a list myself? I initially thought a for loop might be useful but I couldn't think how I could do it logically (I'm not that great at for loops at the moment, mainly due to them being so different from other languages I've messed around with, but I am learning!). Any input here would be great thanks!:
regions = []

South_East_Queensland = xml_soup.findAll('region', {'name': 'South East Queensland'})
South_West_Queensland = xml_soup.findAll('region', {'name': 'South West Queensland'})
Gladstone = xml_soup.findAll('region', {'name': 'Gladstone'})
Mackay = xml_soup.findAll('region', {'name': 'Mackay'})
Townsville = xml_soup.findAll('region', {'name': 'Townsville'})
Mount_Isa = xml_soup.findAll('region', {'name': 'Mount Isa'})

regions.append(South_East_Queensland[0])
regions.append(South_West_Queensland[0])
regions.append(Gladstone[0])
regions.append(Mackay[0])
regions.append(Townsville[0])
regions.append(Mount_Isa[0])

stations = []
for index in range(len(regions)):
    stations.append(regions[index].findAll('station'))
Lastly, I was curious if this for loop at the end of my code for writing the data to the .csv file is the most efficient way to do this because, to be honest, it seemed a little drawn out to me and I feel maybe there would be a way to shorten this process?:

file_name = 'Air Quality.csv'

f = open(file_name, 'w')

f.write(headers)

index = 0
for region in regions:
    region_string = ''
    region_name = region['name'] + ','
    for station in stations[index]:
        station_string = ''
        station_name = station['name'] + ','
        nd = station.findAll('measurement', {'name': 'Nitrogen Dioxide'})
        nd = ',' if len(nd) == 0 else str(nd[0].text) + ','
        o = station.findAll('measurement', {'name': 'Ozone'})
        o = ',' if len(o) == 0 else str(o[0].text) + ','    
        sd = station.findAll('measurement', {'name': 'Sulfur Dioxide'})
        sd = ',' if len(sd) == 0 else str(sd[0].text) + ','    
        cm = station.findAll('measurement', {'name': 'Carbon Monoxide'})
        cm = ',' if len(cm) == 0 else str(cm[0].text) + ','    
        ppm10 = station.findAll('measurement', {'name': 'Particle PM10'})
        ppm10 = ',' if len(ppm10) == 0 else str(ppm10[0].text) + ','    
        ppm2 = station.findAll('measurement', {'name': 'Particle PM2.5'})
        ppm2 = ',' if len(ppm2) == 0 else str(ppm2[0].text) + ','    
        ptsp = station.findAll('measurement', {'name': 'Particles TSP'})
        ptsp = ',' if len(ptsp) == 0 else str(ptsp[0].text) + ','    
        v = station.findAll('measurement', {'name': 'Visibility'})
        v = '\n' if len(v) == 0 else str(v[0].text) + '\n'
        region_string += region_name + station_name + nd + o + sd + cm + ppm10 + ppm2 + ptsp + v
        f.write(region_string)
    index += 1
f.close()
I'm having a lot of fun learning how to scrape data off of websites and from different JSON and XML API's, but I do not currently know anyone that is working in data science or does any data collection. Just personally thought this would be a good skill to know in any type of scientific field and would also love any input in regards to how to make it more efficient, what I'm doing wrong that could be better, as well as how to structure it in a way that is considered more "standardized".

Thanks for any input/advice given,

FalseFact
Reply
#2
Is there a web page associated with this data?
I'm thinking a selenium scraper might be best.
How would you navigate to the data from this page: environment.des.qld.gov.au
I'll write you a sample selenium scraper it I know (page by page) how to get to the data (what you would click on, or type)
This would allow you to use the same scraper for any of the data available.
Reply
#3
(Apr-01-2019, 09:28 AM)Larz60+ Wrote: Is there a web page associated with this data?
Larz, data come as xml response from API, link is in the OP
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#4
Quote:Larz, data come as xml response from API, link is in the OP
Yes, I know that. What I was looking for is this:
https://environment.des.qld.gov.au/air/data/search.php
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Fetching Images from DB in Django Dexty 2 1,623 Mar-15-2024, 08:43 AM
Last Post: firn100
  fetching, parsing data from Wikipedia apollo 2 3,503 May-06-2021, 08:08 PM
Last Post: snippsat
  Logic behind BeautifulSoup data-parsing jimsxxl 7 4,222 Apr-13-2021, 09:06 AM
Last Post: jimsxxl
  how to make my product description fetching function generic? PrateekG 10 5,952 Jun-29-2018, 01:03 PM
Last Post: PrateekG
  Getting 'list index out of range' while fetching product details using BeautifulSoup? PrateekG 8 8,045 Jun-06-2018, 12:15 PM
Last Post: snippsat

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020