Python Forum

Full Version: Scrapping .aspx websites
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hi All,

I'm trying to perform what I thought would be a simple web scrapping task, but am running into an issue I am unable to figure out. The page is a '.aspx' which I suspect has something to do with it.

The task:
(i) Get the name of the school + all the links for elementary schools on this webpage - http://www.yrdsb.ca/Schools/Pages/default.aspx
(ii) On an individual school's webpage, retrieve the grades, address, phone, fax, email, and bell times. Example page for a school: http://www.yrdsb.ca/Schools/Pages/School...hoolID=227

I am able to load the webpage into bs4; but, from that point I am unable to use css selectors or anything I can think of to locate the data I am looking for in the response page.

Does anyone have any clever ideas???

from bs4 import BeautifulSoup 
import urllib3

url = "http://www.yrdsb.ca/Schools/Pages/School-Profile.aspx?SchoolID=227"

http = urllib3.PoolManager()
r = http.request('get', url)
soup = BeautifulSoup(r.data, 'lxml')
This is interesting. It's not an ajax issue, all the schools are definitely listed in the first page returned, but BS doesn't want to find those divs. Also, there's sub-html pages defined within the page, hidden inside xml-cdata tags. I have a feeling that beautiful soup is having issues because the page is malformed in some way, but if you open it in a browser and do document.querySelectorAll("div.sch a") in the javascript console, you get all the schools. idk, I'll have to play around later. Maybe someone else can make sense of this.

If we can't figure out how to massage bs into working right, you could try writing a SAX handler and using the builtin sax processing lib to find the elements you want. Also, it might not be a terrible idea to send this link to the BeautifulSoup developer(s), so they can see if there's something in bs that can be improved.
The simpelt solution's without going into source and figure what JavaScript dos on this page is to use Selenium.
Can just send page source to BS and headless(not loading browser) then scraping work as excepted looking at the page in Browser.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup

#--| Setup
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--log-level=3')
browser = webdriver.Chrome(chrome_options=chrome_options, executable_path=r'path_to_chromedriver')
#--| Parse
browser.get('http://www.yrdsb.ca/Schools/Pages/default.aspx')
soup = BeautifulSoup(browser.page_source, 'lxml')
name = soup.find('div', class_="sch")
print(name.text)
Output:
Adrienne Clarkson P.S.
@boxingowl88 has setup for Both Chrome and Firefox in Web-scraping part-2.
I just tried parsing that website using scrapy shell.
For some reason, css selectors fail, but xpath works without a problem:
>>> len(response.css('div.sch'))
0
>>> len(response.xpath('//div[@class="sch"]'))
212