Python Forum
[SOLVED] requests returning HTTP 404 when I follow a link after I do a POST - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html)
+--- Thread: [SOLVED] requests returning HTTP 404 when I follow a link after I do a POST (/thread-904.html)



[SOLVED] requests returning HTTP 404 when I follow a link after I do a POST - JChris - Nov-13-2016

I get one free ebook a day from Packt Publishing with their "Free Learning - Free Technology Ebooks" promo. I'm trying to automate this process. I do a POST against their root path to login, after that I do a GET on the promo URL and use BeautifulSoup 4 to get the HREF of the "claim your free ebook" link, and now I'm stuck. Here's the code:


    #!/usr/bin/env python
    # -*- coding: utf-8 -*-
    
    import requests
    from bs4 import BeautifulSoup
    
    USERNAME = '[email protected]'
    PASSWORD = 'secret'
    BASE_URL = 'https://www.packtpub.com'
    PROMO_URL = 'https://www.packtpub.com/packt/offers/free-learning'
    
    session = requests.session()
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2810.1 Safari/537.36'}
    session.post(BASE_URL, {"username": USERNAME, "password": PASSWORD}, headers=headers)
    
    response = session.get(PROMO_URL, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')
    current_offer_href = BASE_URL + soup.find("div", {"class": "free-ebook"}).a['href']
    print(current_offer_href)
    print(session.get(current_offer_href, headers=headers))
The current_offer_href is holding the correct value, if you go to the site today (13/NOV/2016) and inspect the button you will find it. In this case, it's holding https://www.packtpub.com/freelearning-claim/17276/21478. If I try to do a GET against current_offer_href I receive <Response [404]>. In reality what I should be getting is a redirect to https://www.packtpub.com/account/my-ebooks, because that's what happen if I click the button manually on the site. What's wrong here?


RE: requests returning HTTP 404 when I follow a link after I do a POST - Larz60+ - Nov-13-2016

Please post (cut & paste) error traceback
Thank you


RE: requests returning HTTP 404 when I follow a link after I do a POST - JChris - Nov-13-2016

(Nov-13-2016, 03:32 PM)Larz60+ Wrote: Please post (cut & paste) error traceback
Thank you

There's no error, just:

D:\...\python.exe D:/.../main.py
https://www.packtpub.com/freelearning-claim/17276/21478
<Response [404]>

Process finished with exit code 0



RE: requests returning HTTP 404 when I follow a link after I do a POST - micseydel - Nov-13-2016

They're probably blocking bots.


RE: requests returning HTTP 404 when I follow a link after I do a POST - Ofnuts - Nov-13-2016

(Nov-13-2016, 05:02 PM)micseydel Wrote: They're probably blocking bots.

Yes... First try to use a plausible http-referer in your headers, then a known user-agent, then watch your cookies.


RE: requests returning HTTP 404 when I follow a link after I do a POST - Blue Dog - Nov-13-2016

Use browsers head file to make the web site thank you are a browsers


Headers

We’ll discuss here one particular HTTP header, to illustrate how to add
headers to your HTTP request.

Some websites [1] dislike being browsed by programs, or send
different versions to different browsers [2]. By default urllib
identifies itself as Python-urllib/x.y (where x and y are the
major and minor version numbers of the Python release, e.g.
Python-urllib/2.5), which may confuse the site,
or just plain not work. The way a browser identifies itself
is through the User-Agent header [3]. When you create a
Request object you can pass a dictionary of headers in.
The following example makes the same request as above,
 but identifies itself as a version of Internet Explorer [4].



import urllib.parse
import urllib.request

url = 'http://www.someserver.com/cgi-bin/register.cgi'
user_agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)'
values = {'name' : 'Michael Foord',
          'location' : 'Northampton',
          'language' : 'Python' }
headers = { 'User-Agent' : user_agent }

data = urllib.parse.urlencode(values)
data = data.encode('ascii')
req = urllib.request.Request(url, data, headers)
with urllib.request.urlopen(req) as response:
   the_page = response.read()



RE: requests returning HTTP 404 when I follow a link after I do a POST - JChris - Nov-13-2016

I found the problem, the site wasn't blocking me, I was just not logged in. I'm trying to do a POST against "https://www.packtpub.com" as I can't find any /login, /signin path, but it isn't working as I wanted. To login manually one would need to visit the root site, click on "Log in" and then a bar will come from above with the fields. They don't seem to have a particular login page, so, how can I login using POST in this case?

[Image: E2xi9dt.png]

[Image: 2vrKXvP.png]


RE: requests returning HTTP 404 when I follow a link after I do a POST - Ofnuts - Nov-14-2016

Tried a URL such as:
http://user:password@host/
?

Btw, the site should have answered with a 403, not a 404.


RE: requests returning HTTP 404 when I follow a link after I do a POST - buran - Nov-14-2016

Please, note that there are 2 more hidden fields that you must supply as parameters to the POST request - form_id and form_build_id
Also username field is actually 'email', not 'username'

import requests
from bs4 import BeautifulSoup
 
USERNAME = '[email protected]'
PASSWORD = 'mypassword'
FORM_BUILD_ID='form-c4aeea083a82fdae7d43562ee8cafeb7'
FORM_ID = 'packt_user_login_form'

BASE_URL = 'https://www.packtpub.com'
PROMO_URL = 'https://www.packtpub.com/packt/offers/free-learning'
 
session = requests.session()
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2810.1 Safari/537.36'}
session.post(BASE_URL, {"email": USERNAME, "password": PASSWORD, 'form_build_id':FORM_BUILD_ID, 'form_id':FORM_ID}, headers=headers) #'
 
response = session.get(PROMO_URL, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
current_offer_href = BASE_URL + soup.find("div", {"class": "free-ebook"}).a['href']
print(current_offer_href)
print(session.get(current_offer_href, headers=headers))
my_account_url = BASE_URL+ '/account' #https://www.packtpub.com/account
response = session.get(my_account_url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
print soup.find('div', class_='menu-account').find('h1').text
Output:
https://www.packtpub.com/freelearning-claim/8294/21478 <Response [200]> Your Name
I'm not sure if form_build_id change over time


RE: requests returning HTTP 404 when I follow a link after I do a POST - JChris - Nov-14-2016

(Nov-14-2016, 11:06 AM)buran Wrote: Please, note that there are 2 more hidden fields that you must supply as parameters to the POST request - form_id and form_build_id
Also username field is actually 'email', not 'username'

import requests
from bs4 import BeautifulSoup
 
USERNAME = '[email protected]'
PASSWORD = 'mypassword'
FORM_BUILD_ID='form-c4aeea083a82fdae7d43562ee8cafeb7'
FORM_ID = 'packt_user_login_form'

BASE_URL = 'https://www.packtpub.com'
PROMO_URL = 'https://www.packtpub.com/packt/offers/free-learning'
 
session = requests.session()
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2810.1 Safari/537.36'}
session.post(BASE_URL, {"email": USERNAME, "password": PASSWORD, 'form_build_id':FORM_BUILD_ID, 'form_id':FORM_ID}, headers=headers) #'
 
response = session.get(PROMO_URL, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
current_offer_href = BASE_URL + soup.find("div", {"class": "free-ebook"}).a['href']
print(current_offer_href)
print(session.get(current_offer_href, headers=headers))
my_account_url = BASE_URL+ '/account' #https://www.packtpub.com/account
response = session.get(my_account_url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
print soup.find('div', class_='menu-account').find('h1').text
Output:
https://www.packtpub.com/freelearning-claim/8294/21478 <Response [200]> Your Name
I'm not sure if form_build_id change over time

Thank you. I really didn't spot the form_build_id and form_id fields. It's now fully working. Just like you, I also don't know if those fields change over time, but it appears they do, because mine is different than yours. I'm getting them on each call, so it doesn't really matter. My code:


#!/usr/bin/env python
# -*- coding: utf-8 -*-

import requests
from bs4 import BeautifulSoup

USERNAME = '[email protected]'
PASSWORD = 'secret'

BASE_URL = 'https://www.packtpub.com'
PROMO_URL = 'https://www.packtpub.com/packt/offers/free-learning'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2810.1 Safari/537.36'}

session = requests.session()
root_page = session.get(BASE_URL, headers=headers)
soup = BeautifulSoup(root_page.text, 'html.parser')
FORM_BUILD_ID = soup.find("input", {"name": "form_build_id"})['value']
FORM_ID = soup.find("input", {"id": "edit-packt-user-login-form"})['value']

session.post(BASE_URL, {"email": USERNAME, "password": PASSWORD, 'form_build_id': FORM_BUILD_ID, 'form_id': FORM_ID}, headers=headers)
promo_page = session.get(PROMO_URL, headers=headers)
soup = BeautifulSoup(promo_page.text, 'html.parser')
current_offer_href = BASE_URL + soup.find("div", {"class": "free-ebook"}).a['href']
print(session.get(current_offer_href, headers=headers))
Output:

<Response [200]>

Process finished with exit code 0