Python Forum

Pages: 1 2 3

How the heck do you scrape an .aspx page??
i try to get the page with requests and it seems to be stuck downloading,
or it's trying to download all links automatically.

I have zero experience with this type of web page,

Thanks again Microsoft!

Hm! Webkit?

Can you post a link?

Here's the site for California Public data Catalog: http://publicpay.ca.gov/Reports/RawExport.aspx

I see some info on Scrapy being able to scrape ASP.Net stuff, but very little.
I'd rather use beautifulsoup or lxml if possible.

One thing I noticed, that makes me think there's an easy method (or at least a method) to convert to html
is that right clicking on the page while in Firefox, and selecting page source immediately brings up the page in html.

Haven't determined if that's useful or not yet.

from bs4 import BeautifulSoup
import requests

url = 'http://publicpay.ca.gov/Reports/RawExport.aspx'
url_get = requests.get(url)
soup = BeautifulSoup(url_get.content, 'lxml')
col = soup.find('div', class_="column_main")
col_all = col.find_all('a')
for link in col_all:
   print(link.get('href')

Output:/RawExport/2015_CaliforniaStateUniversity.zip
/RawExport/2015_City.zip
/RawExport/2015_CommunityCollegeDistrict.zip
/RawExport/2015_County.zip
/RawExport/2015_FairsExpos.zip
/RawExport/2015_First5.zip
/RawExport/2015_K12Education.zip
/RawExport/2015_SpecialDistrict.zip
/RawExport/2015_StateDepartment.zip
/RawExport/2015_SuperiorCourt.zip
/RawExport/2015_UniversityOfCalifornia.zip
/RawExport/2014_CaliforniaStateUniversity.zip
/RawExport/2014_City.zip
..............

Url is first for all is http://publicpay.ca.gov + link that i get out here.
Now can choose download method eg urlreceive() or use write 'wb' with Requests.
For larger files chunk them up can be useful.

 with open(path, 'wb') as f:
    for chunk in r.iter_content(1024):
        f.write(chunk)

Snippsat,

I was thinking you were going to be the savior in this one!.

Thanks a lot, you gave me a bonus ... more than I expected!

.aspx is just html that has c# on the backend (...or visual basic, if whoever wrote the site hates themselves). If the data is on the page, it should be easy to do. If it's NOT, and instead is something like a search form to load results, then things get more difficult. ASP (or at least older versions of it) use something called a "viewstate", which is a hidden field in forms to keep track of the state of server-side variables. It's a trash way of doing things, and most people just used cookies/sessions anyway, but a lot of things snuck into the viewstate if you didn't pay too close attention.

So if you need to get data, sometimes you have to request the base page, scrape it for no reason than to grab what the viewstate value is, and THEN request the actual page, supplying the viewstate you scraped. (...and then use the new viewstate value, in case the results are paginated...)

As in my 1st post of this thread:

Thanks again Microsoft!

I actually use asp.net at work every day. It's pretty good at what it does. It's just the older versions of it... did some odd things. It was very clear that they were trying hard to make websites feel like desktop applications, with button event handlers and whatnot.

I think it looks great.
... I don't have to like it's structure.
Maybe someday I'll love it.

Maybe some day hell will freeze over.
Stranger things have happened

I didn't like indentation when I started using python.
Now I don't even think about it

Another file format that drove me crazy was Lotus-123

Pages: 1 2 3

Larz60+

wavic

snippsat

Larz60+

snippsat

Larz60+

nilamo

Larz60+

nilamo

Larz60+