[SOLVED] requests returning HTTP 404 when I follow a link after I do a POST

JChris · (This post was last modified: Nov-14-2016, 04:11 PM by JChris.)

I get one free ebook a day from Packt Publishing with their "Free Learning - Free Technology Ebooks" promo. I'm trying to automate this process. I do a POST against their root path to login, after that I do a GET on the promo URL and use BeautifulSoup 4 to get the HREF of the "claim your free ebook" link, and now I'm stuck. Here's the code:

    #!/usr/bin/env python
    # -*- coding: utf-8 -*-
    
    import requests
    from bs4 import BeautifulSoup
    
    USERNAME = '[email protected]'
    PASSWORD = 'secret'
    BASE_URL = 'https://www.packtpub.com'
    PROMO_URL = 'https://www.packtpub.com/packt/offers/free-learning'
    
    session = requests.session()
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2810.1 Safari/537.36'}
    session.post(BASE_URL, {"username": USERNAME, "password": PASSWORD}, headers=headers)
    
    response = session.get(PROMO_URL, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')
    current_offer_href = BASE_URL + soup.find("div", {"class": "free-ebook"}).a['href']
    print(current_offer_href)
    print(session.get(current_offer_href, headers=headers))

The current_offer_href is holding the correct value, if you go to the site today (13/NOV/2016) and inspect the button you will find it. In this case, it's holding https://www.packtpub.com/freelearning-claim/17276/21478. If I try to do a GET against current_offer_href I receive <Response [404]>. In reality what I should be getting is a redirect to https://www.packtpub.com/account/my-ebooks, because that's what happen if I click the button manually on the site. What's wrong here?

**Larz60+** · Nov-13-2016, 03:32 PM

Please post (cut & paste) error traceback
Thank you

JChris · Nov-13-2016, 03:46 PM

(Nov-13-2016, 03:32 PM)Larz60+ Wrote: Please post (cut & paste) error traceback
Thank you

There's no error, just:

D:\...\python.exe D:/.../main.py
https://www.packtpub.com/freelearning-claim/17276/21478
<Response [404]>

Process finished with exit code 0

***micseydel*** · Nov-13-2016, 05:02 PM

They're probably blocking bots.

***Ofnuts*** · Nov-13-2016, 05:18 PM

(Nov-13-2016, 05:02 PM)micseydel Wrote: They're probably blocking bots.

Yes... First try to use a plausible http-referer in your headers, then a known user-agent, then watch your cookies.

Blue Dog · Nov-13-2016, 06:14 PM

Use browsers head file to make the web site thank you are a browsers

Headers

We’ll discuss here one particular HTTP header, to illustrate how to add
headers to your HTTP request.

Some websites [1] dislike being browsed by programs, or send
different versions to different browsers [2]. By default urllib
identifies itself as Python-urllib/x.y (where x and y are the
major and minor version numbers of the Python release, e.g.
Python-urllib/2.5), which may confuse the site,
or just plain not work. The way a browser identifies itself
is through the User-Agent header [3]. When you create a
Request object you can pass a dictionary of headers in.
The following example makes the same request as above,
but identifies itself as a version of Internet Explorer [4].

import urllib.parse
import urllib.request

url = 'http://www.someserver.com/cgi-bin/register.cgi'
user_agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)'
values = {'name' : 'Michael Foord',
          'location' : 'Northampton',
          'language' : 'Python' }
headers = { 'User-Agent' : user_agent }

data = urllib.parse.urlencode(values)
data = data.encode('ascii')
req = urllib.request.Request(url, data, headers)
with urllib.request.urlopen(req) as response:
   the_page = response.read()

JChris · Nov-13-2016, 08:26 PM

I found the problem, the site wasn't blocking me, I was just not logged in. I'm trying to do a POST against "https://www.packtpub.com" as I can't find any /login, /signin path, but it isn't working as I wanted. To login manually one would need to visit the root site, click on "Log in" and then a bar will come from above with the fields. They don't seem to have a particular login page, so, how can I login using POST in this case?

[Image: E2xi9dt.png]

[Image: 2vrKXvP.png]

***Ofnuts*** · (This post was last modified: Nov-14-2016, 08:17 AM by Ofnuts.)

Tried a URL such as:

http://user:password@host/

?

Btw, the site should have answered with a 403, not a 404.

**buran** · (This post was last modified: Nov-14-2016, 11:06 AM by buran.)

Please, note that there are 2 more hidden fields that you must supply as parameters to the POST request - form_id and form_build_id
Also username field is actually 'email', not 'username'

import requests
from bs4 import BeautifulSoup
 
USERNAME = '[email protected]'
PASSWORD = 'mypassword'
FORM_BUILD_ID='form-c4aeea083a82fdae7d43562ee8cafeb7'
FORM_ID = 'packt_user_login_form'

BASE_URL = 'https://www.packtpub.com'
PROMO_URL = 'https://www.packtpub.com/packt/offers/free-learning'
 
session = requests.session()
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2810.1 Safari/537.36'}
session.post(BASE_URL, {"email": USERNAME, "password": PASSWORD, 'form_build_id':FORM_BUILD_ID, 'form_id':FORM_ID}, headers=headers) #'
 
response = session.get(PROMO_URL, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
current_offer_href = BASE_URL + soup.find("div", {"class": "free-ebook"}).a['href']
print(current_offer_href)
print(session.get(current_offer_href, headers=headers))
my_account_url = BASE_URL+ '/account' #https://www.packtpub.com/account
response = session.get(my_account_url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
print soup.find('div', class_='menu-account').find('h1').text

Output:https://www.packtpub.com/freelearning-claim/8294/21478
<Response [200]>
Your Name

I'm not sure if form_build_id change over time

JChris · (This post was last modified: Nov-14-2016, 02:24 PM by JChris.)

(Nov-14-2016, 11:06 AM)buran Wrote: Please, note that there are 2 more hidden fields that you must supply as parameters to the POST request - form_id and form_build_id
Also username field is actually 'email', not 'username'

import requests
from bs4 import BeautifulSoup
 
USERNAME = '[email protected]'
PASSWORD = 'mypassword'
FORM_BUILD_ID='form-c4aeea083a82fdae7d43562ee8cafeb7'
FORM_ID = 'packt_user_login_form'

BASE_URL = 'https://www.packtpub.com'
PROMO_URL = 'https://www.packtpub.com/packt/offers/free-learning'
 
session = requests.session()
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2810.1 Safari/537.36'}
session.post(BASE_URL, {"email": USERNAME, "password": PASSWORD, 'form_build_id':FORM_BUILD_ID, 'form_id':FORM_ID}, headers=headers) #'
 
response = session.get(PROMO_URL, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
current_offer_href = BASE_URL + soup.find("div", {"class": "free-ebook"}).a['href']
print(current_offer_href)
print(session.get(current_offer_href, headers=headers))
my_account_url = BASE_URL+ '/account' #https://www.packtpub.com/account
response = session.get(my_account_url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
print soup.find('div', class_='menu-account').find('h1').text

Output:https://www.packtpub.com/freelearning-claim/8294/21478
<Response [200]>
Your Name

I'm not sure if form_build_id change over time

Thank you. I really didn't spot the form_build_id and form_id fields. It's now fully working. Just like you, I also don't know if those fields change over time, but it appears they do, because mine is different than yours. I'm getting them on each call, so it doesn't really matter. My code:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import requests
from bs4 import BeautifulSoup

USERNAME = '[email protected]'
PASSWORD = 'secret'

BASE_URL = 'https://www.packtpub.com'
PROMO_URL = 'https://www.packtpub.com/packt/offers/free-learning'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2810.1 Safari/537.36'}

session = requests.session()
root_page = session.get(BASE_URL, headers=headers)
soup = BeautifulSoup(root_page.text, 'html.parser')
FORM_BUILD_ID = soup.find("input", {"name": "form_build_id"})['value']
FORM_ID = soup.find("input", {"id": "edit-packt-user-login-form"})['value']

session.post(BASE_URL, {"email": USERNAME, "password": PASSWORD, 'form_build_id': FORM_BUILD_ID, 'form_id': FORM_ID}, headers=headers)
promo_page = session.get(PROMO_URL, headers=headers)
soup = BeautifulSoup(promo_page.text, 'html.parser')
current_offer_href = BASE_URL + soup.find("div", {"class": "free-ebook"}).a['href']
print(session.get(current_offer_href, headers=headers))

Output:

<Response [200]>

Process finished with exit code 0

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	POST requests - different requests return the same response	Default_001	3	1,950	Mar-10-2022, 11:26 PM Last Post: Default_001
	requests.post() does work	Alto	1	2,025	Aug-13-2021, 07:58 AM Last Post: ndc85430
	Follow Up: Web Calendar based Extraction	AgileAVS	0	1,507	Feb-23-2020, 05:39 AM Last Post: AgileAVS
	requests issue with post on dot_net api	Heinrich	1	2,490	Jan-23-2020, 04:28 AM Last Post: Larz60+
	Making several POST requests	RayeEThompson507	1	2,605	Nov-25-2019, 08:50 PM Last Post: micseydel
	requests post/get to HTML form	mrdominikku	1	2,333	Nov-03-2019, 07:12 PM Last Post: Larz60+
	get link and link text from table	metulburr	5	6,292	Jun-13-2019, 07:50 PM Last Post: snippsat
	Error in requests.post	debanilroy	3	5,433	Sep-18-2018, 06:15 PM Last Post: snippsat
	How do i loop through list of data from CSV file and post requests in aspx dynamics w	Prince_Bhatia	1	6,096	Nov-09-2017, 02:53 PM Last Post: heiner55

[SOLVED] requests returning HTTP 404 when I follow a link after I do a POST

User Panel Messages

Announcements