[SOLVED] requests returning HTTP 404 when I follow a link after I do a POST - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html) +--- Thread: [SOLVED] requests returning HTTP 404 when I follow a link after I do a POST (/thread-904.html) |
[SOLVED] requests returning HTTP 404 when I follow a link after I do a POST - JChris - Nov-13-2016 I get one free ebook a day from Packt Publishing with their "Free Learning - Free Technology Ebooks" promo. I'm trying to automate this process. I do a POST against their root path to login, after that I do a GET on the promo URL and use BeautifulSoup 4 to get the HREF of the "claim your free ebook" link, and now I'm stuck. Here's the code: #!/usr/bin/env python # -*- coding: utf-8 -*- import requests from bs4 import BeautifulSoup USERNAME = '[email protected]' PASSWORD = 'secret' BASE_URL = 'https://www.packtpub.com' PROMO_URL = 'https://www.packtpub.com/packt/offers/free-learning' session = requests.session() headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2810.1 Safari/537.36'} session.post(BASE_URL, {"username": USERNAME, "password": PASSWORD}, headers=headers) response = session.get(PROMO_URL, headers=headers) soup = BeautifulSoup(response.text, 'html.parser') current_offer_href = BASE_URL + soup.find("div", {"class": "free-ebook"}).a['href'] print(current_offer_href) print(session.get(current_offer_href, headers=headers))The current_offer_href is holding the correct value, if you go to the site today (13/NOV/2016) and inspect the button you will find it. In this case, it's holding https://www.packtpub.com/freelearning-claim/17276/21478 . If I try to do a GET against current_offer_href I receive <Response [404]> . In reality what I should be getting is a redirect to https://www.packtpub.com/account/my-ebooks , because that's what happen if I click the button manually on the site. What's wrong here?
RE: requests returning HTTP 404 when I follow a link after I do a POST - Larz60+ - Nov-13-2016 Please post (cut & paste) error traceback Thank you RE: requests returning HTTP 404 when I follow a link after I do a POST - JChris - Nov-13-2016 (Nov-13-2016, 03:32 PM)Larz60+ Wrote: Please post (cut & paste) error traceback There's no error, just: D:\...\python.exe D:/.../main.py https://www.packtpub.com/freelearning-claim/17276/21478 <Response [404]> Process finished with exit code 0 RE: requests returning HTTP 404 when I follow a link after I do a POST - micseydel - Nov-13-2016 They're probably blocking bots. RE: requests returning HTTP 404 when I follow a link after I do a POST - Ofnuts - Nov-13-2016 (Nov-13-2016, 05:02 PM)micseydel Wrote: They're probably blocking bots. Yes... First try to use a plausible http-referer in your headers, then a known user-agent, then watch your cookies. RE: requests returning HTTP 404 when I follow a link after I do a POST - Blue Dog - Nov-13-2016 Use browsers head file to make the web site thank you are a browsers Headers We’ll discuss here one particular HTTP header, to illustrate how to add headers to your HTTP request. Some websites [1] dislike being browsed by programs, or send different versions to different browsers [2]. By default urllib identifies itself as Python-urllib/x.y (where x and y are the major and minor version numbers of the Python release, e.g. Python-urllib/2.5), which may confuse the site, or just plain not work. The way a browser identifies itself is through the User-Agent header [3]. When you create a Request object you can pass a dictionary of headers in. The following example makes the same request as above, but identifies itself as a version of Internet Explorer [4]. import urllib.parse import urllib.request url = 'http://www.someserver.com/cgi-bin/register.cgi' user_agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)' values = {'name' : 'Michael Foord', 'location' : 'Northampton', 'language' : 'Python' } headers = { 'User-Agent' : user_agent } data = urllib.parse.urlencode(values) data = data.encode('ascii') req = urllib.request.Request(url, data, headers) with urllib.request.urlopen(req) as response: the_page = response.read() RE: requests returning HTTP 404 when I follow a link after I do a POST - JChris - Nov-13-2016 I found the problem, the site wasn't blocking me, I was just not logged in. I'm trying to do a POST against "https://www.packtpub.com" as I can't find any /login, /signin path, but it isn't working as I wanted. To login manually one would need to visit the root site, click on "Log in" and then a bar will come from above with the fields. They don't seem to have a particular login page, so, how can I login using POST in this case? [Image: E2xi9dt.png] [Image: 2vrKXvP.png] RE: requests returning HTTP 404 when I follow a link after I do a POST - Ofnuts - Nov-14-2016 Tried a URL such as: http://user:password@host/? Btw, the site should have answered with a 403, not a 404. RE: requests returning HTTP 404 when I follow a link after I do a POST - buran - Nov-14-2016 Please, note that there are 2 more hidden fields that you must supply as parameters to the POST request - form_id and form_build_id Also username field is actually 'email', not 'username' import requests from bs4 import BeautifulSoup USERNAME = '[email protected]' PASSWORD = 'mypassword' FORM_BUILD_ID='form-c4aeea083a82fdae7d43562ee8cafeb7' FORM_ID = 'packt_user_login_form' BASE_URL = 'https://www.packtpub.com' PROMO_URL = 'https://www.packtpub.com/packt/offers/free-learning' session = requests.session() headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2810.1 Safari/537.36'} session.post(BASE_URL, {"email": USERNAME, "password": PASSWORD, 'form_build_id':FORM_BUILD_ID, 'form_id':FORM_ID}, headers=headers) #' response = session.get(PROMO_URL, headers=headers) soup = BeautifulSoup(response.text, 'html.parser') current_offer_href = BASE_URL + soup.find("div", {"class": "free-ebook"}).a['href'] print(current_offer_href) print(session.get(current_offer_href, headers=headers)) my_account_url = BASE_URL+ '/account' #https://www.packtpub.com/account response = session.get(my_account_url, headers=headers) soup = BeautifulSoup(response.text, 'html.parser') print soup.find('div', class_='menu-account').find('h1').text I'm not sure if form_build_id change over time
RE: requests returning HTTP 404 when I follow a link after I do a POST - JChris - Nov-14-2016 (Nov-14-2016, 11:06 AM)buran Wrote: Please, note that there are 2 more hidden fields that you must supply as parameters to the POST request - form_id and form_build_id Thank you. I really didn't spot the form_build_id and form_id fields. It's now fully working. Just like you, I also don't know if those fields change over time, but it appears they do, because mine is different than yours. I'm getting them on each call, so it doesn't really matter. My code: #!/usr/bin/env python # -*- coding: utf-8 -*- import requests from bs4 import BeautifulSoup USERNAME = '[email protected]' PASSWORD = 'secret' BASE_URL = 'https://www.packtpub.com' PROMO_URL = 'https://www.packtpub.com/packt/offers/free-learning' headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2810.1 Safari/537.36'} session = requests.session() root_page = session.get(BASE_URL, headers=headers) soup = BeautifulSoup(root_page.text, 'html.parser') FORM_BUILD_ID = soup.find("input", {"name": "form_build_id"})['value'] FORM_ID = soup.find("input", {"id": "edit-packt-user-login-form"})['value'] session.post(BASE_URL, {"email": USERNAME, "password": PASSWORD, 'form_build_id': FORM_BUILD_ID, 'form_id': FORM_ID}, headers=headers) promo_page = session.get(PROMO_URL, headers=headers) soup = BeautifulSoup(promo_page.text, 'html.parser') current_offer_href = BASE_URL + soup.find("div", {"class": "free-ebook"}).a['href'] print(session.get(current_offer_href, headers=headers))Output: <Response [200]> Process finished with exit code 0 |