Bottom Page

Thread Rating:
  • 1 Vote(s) - 4 Average
  • 1
  • 2
  • 3
  • 4
  • 5
 Scraping .aspx page
#1
How the heck do you scrape an .aspx page??
i try to get the page with requests and it seems to be stuck downloading,
or it's trying to download all links automatically.

I have zero experience with this type of web page,

Thanks again Microsoft!
Quote
#2
Hm! Webkit?
"As they say in Mexico 'dosvidaniya'. That makes two vidaniyas."
https://freedns.afraid.org
Quote
#3
Can you post a link?
Quote
#4
Here's the site for California Public data Catalog: http://publicpay.ca.gov/Reports/RawExport.aspx

I see some info on Scrapy being able to scrape ASP.Net stuff, but very little.
I'd rather use beautifulsoup or lxml if possible.

One thing I noticed, that makes me think there's an easy method (or at least a method) to convert to html
is that right clicking on the page while in Firefox, and selecting page source immediately brings up the page in html.

Haven't determined if that's useful or not yet.
Quote
#5
from bs4 import BeautifulSoup
import requests

url = 'http://publicpay.ca.gov/Reports/RawExport.aspx'
url_get = requests.get(url)
soup = BeautifulSoup(url_get.content, 'lxml')
col = soup.find('div', class_="column_main")
col_all = col.find_all('a')
for link in col_all:
   print(link.get('href')
Output:
/RawExport/2015_CaliforniaStateUniversity.zip /RawExport/2015_City.zip /RawExport/2015_CommunityCollegeDistrict.zip /RawExport/2015_County.zip /RawExport/2015_FairsExpos.zip /RawExport/2015_First5.zip /RawExport/2015_K12Education.zip /RawExport/2015_SpecialDistrict.zip /RawExport/2015_StateDepartment.zip /RawExport/2015_SuperiorCourt.zip /RawExport/2015_UniversityOfCalifornia.zip /RawExport/2014_CaliforniaStateUniversity.zip /RawExport/2014_City.zip ..............
Url is first for all is http://publicpay.ca.gov + link that i get out here.
Now can choose download method eg urlreceive() or use write 'wb' with Requests.
For larger files chunk them up can be useful.
 with open(path, 'wb') as f:
    for chunk in r.iter_content(1024):
        f.write(chunk)
Larz60+ and zivoni like this post
Quote
#6
Snippsat,

I was thinking you were going to be the savior in this one!.

Thanks a lot, you gave me a bonus ... more than I expected!
Quote
#7
.aspx is just html that has c# on the backend (...or visual basic, if whoever wrote the site hates themselves). If the data is on the page, it should be easy to do. If it's NOT, and instead is something like a search form to load results, then things get more difficult. ASP (or at least older versions of it) use something called a "viewstate", which is a hidden field in forms to keep track of the state of server-side variables. It's a trash way of doing things, and most people just used cookies/sessions anyway, but a lot of things snuck into the viewstate if you didn't pay too close attention.

So if you need to get data, sometimes you have to request the base page, scrape it for no reason than to grab what the viewstate value is, and THEN request the actual page, supplying the viewstate you scraped. (...and then use the new viewstate value, in case the results are paginated...)
Quote
#8
As in my 1st post of this thread:

Thanks again Microsoft!
Quote
#9
I actually use asp.net at work every day. It's pretty good at what it does. It's just the older versions of it... did some odd things. It was very clear that they were trying hard to make websites feel like desktop applications, with button event handlers and whatnot.
Quote
#10
I think it looks great.
... I don't have to like it's structure.
Maybe someday I'll love it.

Maybe some day hell will freeze over.
Stranger things have happened

I didn't like indentation when I started using python.
Now I don't even think about it

Another file format that drove me crazy was Lotus-123
Quote

Top Page

Possibly Related Threads...
Thread Author Replies Views Last Post
  any way to load page in utf-8 encoding adnanahsan 0 50 Aug-23-2019, 01:55 AM
Last Post: adnanahsan
  Formatting Output After Web Scraping yoitspython 3 188 Aug-01-2019, 01:22 PM
Last Post: snippsat
  Unable to switch out of nested frames into main page abi17124 0 95 Jul-17-2019, 06:06 PM
Last Post: abi17124
  Django Two blocks of dynamic content on one page iFunKtion 5 511 Jul-04-2019, 02:31 AM
Last Post: noisefloor
  web scraping to csv formatting problems bluethundr 4 333 Jul-04-2019, 02:00 AM
Last Post: Larz60+
  Beautifulsoup Scraping PolskaYBZ 3 357 Jun-22-2019, 10:05 AM
Last Post: PolskaYBZ
  Can't get method to scroll down page. caarsonr 5 344 Jun-20-2019, 09:14 PM
Last Post: caarsonr
  Web scraping using bs4 klllmmm 3 370 Jun-10-2019, 02:24 AM
Last Post: Larz60+
  Scraping with some delay Truman 3 249 Jun-10-2019, 12:00 AM
Last Post: metulburr
  Web scraping doubt pixel_chick 3 323 Jun-05-2019, 06:43 AM
Last Post: mahi

Forum Jump:


Users browsing this thread: 1 Guest(s)