Python Forum

Full Version: Download a link that re-directs to a login page
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Been struggling with this one for a while so hoping someone can give me a few ideas.

I've wrote a script to download links from an email. This works pretty well most of the time. The majority of the script is just parsing an email to harvest the links, then downloading using wget:

link = '(https://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+)>'
pc = re.findall(link,searchtext)

for l in pc:
    wget.download (l,path)
So far, so good.

Recently, the website changed the location where the link points to, and it now requires authentication. An example link is here, it's a long link that redirects to a login page here.

If I run the script, it generates a ton of errors with this code at the end:

Error:
raise HTTPError(req.full_url, code, urllib.error.HTTPError: HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop. The last 30x error message was: Found
So I tried it in my browser and it now redirects to this page requesting authentication.

Inspecting the form shows a few fields called session[email] and session[password], and once you click login, it posts this info before redirecting to a landing page of sorts for the project.

I've tried to login first using requests.

import requests
s = requests.Session()
data = {"session[email]":"(email address here)", "session[password]":"(password here)"}
url = "https://login.procore.com/sessions"
r = s.post(url, data=data)
When I check r, I get response 200. So I load a second request to get but then get response 401.

import requests
s = requests.Session()
data = {"session[email]":"(email address here)", "session[password]":"(password here)"}
url = "https://login.procore.com/sessions"
r = s.post(url, data=data)
getfile="https://app.procore.com/783343/project/submittal_logs/document_downloader?attachment_id=2534930332&item_id=25772901&item_type=SubmittalLog&project_id=783343"
r1 = s.get(getfile)
That returns a 401 error. I also tried the wget method after signing in but still returns 302.

So I feel like I'm either over-complicating it or I've driven past the point and totally missed something so obvious I will bang my head against the desk for a half hour.

So if anyone has any advice on this, would be greatly appreciated. And if you've gotten this far, thanks for reading through this novel!
Did you know you could use cookies with requests when sending headers in the request instance, this is what i do when i'm dealing with mass scraping with sites that require captcha to users that aren't logged in.