Scrapping .aspx websites

boxingowl88 · Oct-09-2018, 08:45 PM

Hi All,

I'm trying to perform what I thought would be a simple web scrapping task, but am running into an issue I am unable to figure out. The page is a '.aspx' which I suspect has something to do with it.

The task:
(i) Get the name of the school + all the links for elementary schools on this webpage - http://www.yrdsb.ca/Schools/Pages/default.aspx
(ii) On an individual school's webpage, retrieve the grades, address, phone, fax, email, and bell times. Example page for a school: http://www.yrdsb.ca/Schools/Pages/School...hoolID=227

I am able to load the webpage into bs4; but, from that point I am unable to use css selectors or anything I can think of to locate the data I am looking for in the response page.

Does anyone have any clever ideas???

from bs4 import BeautifulSoup 
import urllib3

url = "http://www.yrdsb.ca/Schools/Pages/School-Profile.aspx?SchoolID=227"

http = urllib3.PoolManager()
r = http.request('get', url)
soup = BeautifulSoup(r.data, 'lxml')

**nilamo** · Oct-09-2018, 10:13 PM

This is interesting. It's not an ajax issue, all the schools are definitely listed in the first page returned, but BS doesn't want to find those divs. Also, there's sub-html pages defined within the page, hidden inside xml-cdata tags. I have a feeling that beautiful soup is having issues because the page is malformed in some way, but if you open it in a browser and do document.querySelectorAll("div.sch a") in the javascript console, you get all the schools. idk, I'll have to play around later. Maybe someone else can make sense of this.

If we can't figure out how to massage bs into working right, you could try writing a SAX handler and using the builtin sax processing lib to find the elements you want. Also, it might not be a terrible idea to send this link to the BeautifulSoup developer(s), so they can see if there's something in bs that can be improved.

***snippsat*** · Oct-09-2018, 11:39 PM

The simpelt solution's without going into source and figure what JavaScript dos on this page is to use Selenium.
Can just send page source to BS and headless(not loading browser) then scraping work as excepted looking at the page in Browser.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup

#--| Setup
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--log-level=3')
browser = webdriver.Chrome(chrome_options=chrome_options, executable_path=r'path_to_chromedriver')
#--| Parse
browser.get('http://www.yrdsb.ca/Schools/Pages/default.aspx')
soup = BeautifulSoup(browser.page_source, 'lxml')
name = soup.find('div', class_="sch")
print(name.text)

Output:
Adrienne Clarkson P.S.

@boxingowl88 has setup for Both Chrome and Firefox in Web-scraping part-2.

***stranac*** · Oct-10-2018, 05:35 PM

I just tried parsing that website using scrapy shell.
For some reason, css selectors fail, but xpath works without a problem:

>>> len(response.css('div.sch'))
0
>>> len(response.xpath('//div[@class="sch"]'))
212

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Problem with scrapping Website	giddyhead	1	1,629	Mar-08-2024, 08:20 AM Last Post: AhanaSharma
	python web scrapping	mg24	1	329	Mar-01-2024, 09:48 PM Last Post: snippsat
	Webscrapping sport betting websites	KoinKoin	3	5,460	Nov-08-2023, 03:00 PM Last Post: LoriBrown
	How can I ignore empty fields when scrapping	never5000	0	1,393	Feb-11-2022, 09:19 AM Last Post: never5000
	Suggestion request for scrapping html table	Vkkindia	3	2,033	Dec-06-2021, 06:09 PM Last Post: Larz60+
	web scrapping through Python	Naheed	2	2,621	May-17-2021, 12:02 PM Last Post: Naheed
	Website scrapping and download	santoshrane	3	4,323	Apr-14-2021, 07:22 AM Last Post: kashcode
	Scraping .aspx page	Larz60+	21	51,220	Mar-18-2021, 10:16 AM Last Post: Larz60+
	Web Scraping Sportsbook Websites	Khuber79	17	299,858	Mar-17-2021, 12:06 AM Last Post: Whitesox1
	How to get registeration data from a website that uses .aspx? Help me brothers.	humble_coder	1	2,457	Feb-18-2021, 06:03 PM Last Post: Larz60+

Scrapping .aspx websites

User Panel Messages

Announcements