Bottom Page

Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
 Fetching and Parsing XML Data
#1
Hello everyone,

I was just making this post to see if there is a better way to do certain things in my code for fetching and parsing data from an XML API. I'm using an Australian environment air quality .XML file to look at for this process, with the link being: Australian XML Database for Air Quality. (The link for the xml data page is also located in the code snippet below if you do not trust my hyperlink).

First here is the entire python code snippet:
from bs4 import BeautifulSoup as soup
import requests

URL = 'https://environment.des.qld.gov.au/cgi-bin/air/xml.php?category=1&region=ALL'
raw_data = requests.get(URL, verify = True)

xml_soup = soup(raw_data.content, 'xml')

regions = []

South_East_Queensland = xml_soup.findAll('region', {'name': 'South East Queensland'})
South_West_Queensland = xml_soup.findAll('region', {'name': 'South West Queensland'})
Gladstone = xml_soup.findAll('region', {'name': 'Gladstone'})
Mackay = xml_soup.findAll('region', {'name': 'Mackay'})
Townsville = xml_soup.findAll('region', {'name': 'Townsville'})
Mount_Isa = xml_soup.findAll('region', {'name': 'Mount Isa'})

regions.append(South_East_Queensland[0])
regions.append(South_West_Queensland[0])
regions.append(Gladstone[0])
regions.append(Mackay[0])
regions.append(Townsville[0])
regions.append(Mount_Isa[0])

stations = []
for index in range(len(regions)):
    stations.append(regions[index].findAll('station'))

headers = 'Region, Station Name, Nitrogen Dioxide, Ozone, Sulfur Dioxide, Carbon Monoxide, Particle PM10, Particle PM2.5, Particles TSP, Visibility\n'

file_name = 'Air Quality.csv'

f = open(file_name, 'w')

f.write(headers)

index = 0
for region in regions:
    region_string = ''
    region_name = region['name'] + ','
    for station in stations[index]:
        station_string = ''
        station_name = station['name'] + ','
        nd = station.findAll('measurement', {'name': 'Nitrogen Dioxide'})
        nd = ',' if len(nd) == 0 else str(nd[0].text) + ','
        o = station.findAll('measurement', {'name': 'Ozone'})
        o = ',' if len(o) == 0 else str(o[0].text) + ','    
        sd = station.findAll('measurement', {'name': 'Sulfur Dioxide'})
        sd = ',' if len(sd) == 0 else str(sd[0].text) + ','    
        cm = station.findAll('measurement', {'name': 'Carbon Monoxide'})
        cm = ',' if len(cm) == 0 else str(cm[0].text) + ','    
        ppm10 = station.findAll('measurement', {'name': 'Particle PM10'})
        ppm10 = ',' if len(ppm10) == 0 else str(ppm10[0].text) + ','    
        ppm2 = station.findAll('measurement', {'name': 'Particle PM2.5'})
        ppm2 = ',' if len(ppm2) == 0 else str(ppm2[0].text) + ','    
        ptsp = station.findAll('measurement', {'name': 'Particles TSP'})
        ptsp = ',' if len(ptsp) == 0 else str(ptsp[0].text) + ','    
        v = station.findAll('measurement', {'name': 'Visibility'})
        v = '\n' if len(v) == 0 else str(v[0].text) + '\n'
        region_string += region_name + station_name + nd + o + sd + cm + ppm10 + ppm2 + ptsp + v
        f.write(region_string)
    index += 1
f.close()
To begin, I'm not really concerning over classes and objects for my programming at this time but if anyone has any input there I am happy to hear it so I can improve my methodology here.

Likely, my biggest issue is for loops and setting up a new list in regards to the regional and station data.

In regards to the code snippet below, is there a better way to fetch the regional data and put it into an array/list with each region in its proper location so it can be easily accessed later instead of pretty much having to place it into a list myself? I initially thought a for loop might be useful but I couldn't think how I could do it logically (I'm not that great at for loops at the moment, mainly due to them being so different from other languages I've messed around with, but I am learning!). Any input here would be great thanks!:
regions = []

South_East_Queensland = xml_soup.findAll('region', {'name': 'South East Queensland'})
South_West_Queensland = xml_soup.findAll('region', {'name': 'South West Queensland'})
Gladstone = xml_soup.findAll('region', {'name': 'Gladstone'})
Mackay = xml_soup.findAll('region', {'name': 'Mackay'})
Townsville = xml_soup.findAll('region', {'name': 'Townsville'})
Mount_Isa = xml_soup.findAll('region', {'name': 'Mount Isa'})

regions.append(South_East_Queensland[0])
regions.append(South_West_Queensland[0])
regions.append(Gladstone[0])
regions.append(Mackay[0])
regions.append(Townsville[0])
regions.append(Mount_Isa[0])

stations = []
for index in range(len(regions)):
    stations.append(regions[index].findAll('station'))
Lastly, I was curious if this for loop at the end of my code for writing the data to the .csv file is the most efficient way to do this because, to be honest, it seemed a little drawn out to me and I feel maybe there would be a way to shorten this process?:

file_name = 'Air Quality.csv'

f = open(file_name, 'w')

f.write(headers)

index = 0
for region in regions:
    region_string = ''
    region_name = region['name'] + ','
    for station in stations[index]:
        station_string = ''
        station_name = station['name'] + ','
        nd = station.findAll('measurement', {'name': 'Nitrogen Dioxide'})
        nd = ',' if len(nd) == 0 else str(nd[0].text) + ','
        o = station.findAll('measurement', {'name': 'Ozone'})
        o = ',' if len(o) == 0 else str(o[0].text) + ','    
        sd = station.findAll('measurement', {'name': 'Sulfur Dioxide'})
        sd = ',' if len(sd) == 0 else str(sd[0].text) + ','    
        cm = station.findAll('measurement', {'name': 'Carbon Monoxide'})
        cm = ',' if len(cm) == 0 else str(cm[0].text) + ','    
        ppm10 = station.findAll('measurement', {'name': 'Particle PM10'})
        ppm10 = ',' if len(ppm10) == 0 else str(ppm10[0].text) + ','    
        ppm2 = station.findAll('measurement', {'name': 'Particle PM2.5'})
        ppm2 = ',' if len(ppm2) == 0 else str(ppm2[0].text) + ','    
        ptsp = station.findAll('measurement', {'name': 'Particles TSP'})
        ptsp = ',' if len(ptsp) == 0 else str(ptsp[0].text) + ','    
        v = station.findAll('measurement', {'name': 'Visibility'})
        v = '\n' if len(v) == 0 else str(v[0].text) + '\n'
        region_string += region_name + station_name + nd + o + sd + cm + ppm10 + ppm2 + ptsp + v
        f.write(region_string)
    index += 1
f.close()
I'm having a lot of fun learning how to scrape data off of websites and from different JSON and XML API's, but I do not currently know anyone that is working in data science or does any data collection. Just personally thought this would be a good skill to know in any type of scientific field and would also love any input in regards to how to make it more efficient, what I'm doing wrong that could be better, as well as how to structure it in a way that is considered more "standardized".

Thanks for any input/advice given,

FalseFact
Quote
#2
Is there a web page associated with this data?
I'm thinking a selenium scraper might be best.
How would you navigate to the data from this page: environment.des.qld.gov.au
I'll write you a sample selenium scraper it I know (page by page) how to get to the data (what you would click on, or type)
This would allow you to use the same scraper for any of the data available.
Quote
#3
(Apr-01-2019, 09:28 AM)Larz60+ Wrote: Is there a web page associated with this data?
Larz, data come as xml response from API, link is in the OP
Quote
#4
Quote:Larz, data come as xml response from API, link is in the OP
Yes, I know that. What I was looking for is this:
https://environment.des.qld.gov.au/air/data/search.php
Quote

Top Page

Possibly Related Threads...
Thread Author Replies Views Last Post
  Parsing infor from scraped files. Larz60+ 2 212 Apr-12-2019, 05:06 PM
Last Post: Larz60+
  Selenium Parsing (unable to Parse page after loading) oneclick 6 487 Jan-13-2019, 03:10 AM
Last Post: oneclick
  XML parsing from URL mightyn00b 5 1,785 Nov-22-2018, 02:59 AM
Last Post: Larz60+
  XML Parsing - Find a specific text (ElementTree) TeraX 3 499 Oct-09-2018, 09:06 AM
Last Post: TeraX
  XML parsing and generating HTML page Python 3.6 Madhuri 2 486 Aug-24-2018, 02:48 PM
Last Post: snippsat
  how to make my product description fetching function generic? PrateekG 10 1,105 Jun-29-2018, 01:03 PM
Last Post: PrateekG
  Getting 'list index out of range' while fetching product details using BeautifulSoup? PrateekG 8 1,570 Jun-06-2018, 12:15 PM
Last Post: snippsat
  Problem parsing website html file thefpgarace 2 668 May-01-2018, 11:09 AM
Last Post: Standard_user
  parsing table ian 10 2,805 Apr-27-2018, 01:05 PM
Last Post: snippsat
  beautiful soup - parsing scraped code in a script lilbigwill99 2 567 Mar-09-2018, 04:10 PM
Last Post: lilbigwill99

Forum Jump:


Users browsing this thread: 1 Guest(s)