Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Scrapping .aspx websites
#1
Hi All,

I'm trying to perform what I thought would be a simple web scrapping task, but am running into an issue I am unable to figure out. The page is a '.aspx' which I suspect has something to do with it.

The task:
(i) Get the name of the school + all the links for elementary schools on this webpage - http://www.yrdsb.ca/Schools/Pages/default.aspx
(ii) On an individual school's webpage, retrieve the grades, address, phone, fax, email, and bell times. Example page for a school: http://www.yrdsb.ca/Schools/Pages/School...hoolID=227

I am able to load the webpage into bs4; but, from that point I am unable to use css selectors or anything I can think of to locate the data I am looking for in the response page.

Does anyone have any clever ideas???

from bs4 import BeautifulSoup 
import urllib3

url = "http://www.yrdsb.ca/Schools/Pages/School-Profile.aspx?SchoolID=227"

http = urllib3.PoolManager()
r = http.request('get', url)
soup = BeautifulSoup(r.data, 'lxml')
Reply
#2
This is interesting. It's not an ajax issue, all the schools are definitely listed in the first page returned, but BS doesn't want to find those divs. Also, there's sub-html pages defined within the page, hidden inside xml-cdata tags. I have a feeling that beautiful soup is having issues because the page is malformed in some way, but if you open it in a browser and do document.querySelectorAll("div.sch a") in the javascript console, you get all the schools. idk, I'll have to play around later. Maybe someone else can make sense of this.

If we can't figure out how to massage bs into working right, you could try writing a SAX handler and using the builtin sax processing lib to find the elements you want. Also, it might not be a terrible idea to send this link to the BeautifulSoup developer(s), so they can see if there's something in bs that can be improved.
Reply
#3
The simpelt solution's without going into source and figure what JavaScript dos on this page is to use Selenium.
Can just send page source to BS and headless(not loading browser) then scraping work as excepted looking at the page in Browser.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup

#--| Setup
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--log-level=3')
browser = webdriver.Chrome(chrome_options=chrome_options, executable_path=r'path_to_chromedriver')
#--| Parse
browser.get('http://www.yrdsb.ca/Schools/Pages/default.aspx')
soup = BeautifulSoup(browser.page_source, 'lxml')
name = soup.find('div', class_="sch")
print(name.text)
Output:
Adrienne Clarkson P.S.
@boxingowl88 has setup for Both Chrome and Firefox in Web-scraping part-2.
Reply
#4
I just tried parsing that website using scrapy shell.
For some reason, css selectors fail, but xpath works without a problem:
>>> len(response.css('div.sch'))
0
>>> len(response.xpath('//div[@class="sch"]'))
212
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Problem with scrapping Website giddyhead 1 1,582 Mar-08-2024, 08:20 AM
Last Post: AhanaSharma
  python web scrapping mg24 1 270 Mar-01-2024, 09:48 PM
Last Post: snippsat
  Webscrapping sport betting websites KoinKoin 3 5,338 Nov-08-2023, 03:00 PM
Last Post: LoriBrown
  How can I ignore empty fields when scrapping never5000 0 1,354 Feb-11-2022, 09:19 AM
Last Post: never5000
  Suggestion request for scrapping html table Vkkindia 3 1,988 Dec-06-2021, 06:09 PM
Last Post: Larz60+
  web scrapping through Python Naheed 2 2,578 May-17-2021, 12:02 PM
Last Post: Naheed
  Website scrapping and download santoshrane 3 4,257 Apr-14-2021, 07:22 AM
Last Post: kashcode
  Scraping .aspx page Larz60+ 21 50,857 Mar-18-2021, 10:16 AM
Last Post: Larz60+
  Web Scraping Sportsbook Websites Khuber79 17 256,979 Mar-17-2021, 12:06 AM
Last Post: Whitesox1
  How to get registeration data from a website that uses .aspx? Help me brothers. humble_coder 1 2,426 Feb-18-2021, 06:03 PM
Last Post: Larz60+

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020