Python Forum
[SOLVED] requests returning HTTP 404 when I follow a link after I do a POST
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
[SOLVED] requests returning HTTP 404 when I follow a link after I do a POST
#1
I get one free ebook a day from Packt Publishing with their "Free Learning - Free Technology Ebooks" promo. I'm trying to automate this process. I do a POST against their root path to login, after that I do a GET on the promo URL and use BeautifulSoup 4 to get the HREF of the "claim your free ebook" link, and now I'm stuck. Here's the code:


    #!/usr/bin/env python
    # -*- coding: utf-8 -*-
    
    import requests
    from bs4 import BeautifulSoup
    
    USERNAME = '[email protected]'
    PASSWORD = 'secret'
    BASE_URL = 'https://www.packtpub.com'
    PROMO_URL = 'https://www.packtpub.com/packt/offers/free-learning'
    
    session = requests.session()
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2810.1 Safari/537.36'}
    session.post(BASE_URL, {"username": USERNAME, "password": PASSWORD}, headers=headers)
    
    response = session.get(PROMO_URL, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')
    current_offer_href = BASE_URL + soup.find("div", {"class": "free-ebook"}).a['href']
    print(current_offer_href)
    print(session.get(current_offer_href, headers=headers))
The current_offer_href is holding the correct value, if you go to the site today (13/NOV/2016) and inspect the button you will find it. In this case, it's holding https://www.packtpub.com/freelearning-claim/17276/21478. If I try to do a GET against current_offer_href I receive <Response [404]>. In reality what I should be getting is a redirect to https://www.packtpub.com/account/my-ebooks, because that's what happen if I click the button manually on the site. What's wrong here?
Reply
#2
Please post (cut & paste) error traceback
Thank you
Reply
#3
(Nov-13-2016, 03:32 PM)Larz60+ Wrote: Please post (cut & paste) error traceback
Thank you

There's no error, just:

D:\...\python.exe D:/.../main.py
https://www.packtpub.com/freelearning-claim/17276/21478
<Response [404]>

Process finished with exit code 0
Reply
#4
They're probably blocking bots.
Reply
#5
(Nov-13-2016, 05:02 PM)micseydel Wrote: They're probably blocking bots.

Yes... First try to use a plausible http-referer in your headers, then a known user-agent, then watch your cookies.
Unless noted otherwise, code in my posts should be understood as "coding suggestions", and its use may require more neurones than the two necessary for Ctrl-C/Ctrl-V.
Your one-stop place for all your GIMP needs: gimp-forum.net
Reply
#6
Use browsers head file to make the web site thank you are a browsers


Headers

We’ll discuss here one particular HTTP header, to illustrate how to add
headers to your HTTP request.

Some websites [1] dislike being browsed by programs, or send
different versions to different browsers [2]. By default urllib
identifies itself as Python-urllib/x.y (where x and y are the
major and minor version numbers of the Python release, e.g.
Python-urllib/2.5), which may confuse the site,
or just plain not work. The way a browser identifies itself
is through the User-Agent header [3]. When you create a
Request object you can pass a dictionary of headers in.
The following example makes the same request as above,
 but identifies itself as a version of Internet Explorer [4].



import urllib.parse
import urllib.request

url = 'http://www.someserver.com/cgi-bin/register.cgi'
user_agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)'
values = {'name' : 'Michael Foord',
          'location' : 'Northampton',
          'language' : 'Python' }
headers = { 'User-Agent' : user_agent }

data = urllib.parse.urlencode(values)
data = data.encode('ascii')
req = urllib.request.Request(url, data, headers)
with urllib.request.urlopen(req) as response:
   the_page = response.read()
Reply
#7
I found the problem, the site wasn't blocking me, I was just not logged in. I'm trying to do a POST against "https://www.packtpub.com" as I can't find any /login, /signin path, but it isn't working as I wanted. To login manually one would need to visit the root site, click on "Log in" and then a bar will come from above with the fields. They don't seem to have a particular login page, so, how can I login using POST in this case?

[Image: E2xi9dt.png]

[Image: 2vrKXvP.png]
Reply
#8
Tried a URL such as:
http://user:password@host/
?

Btw, the site should have answered with a 403, not a 404.
Unless noted otherwise, code in my posts should be understood as "coding suggestions", and its use may require more neurones than the two necessary for Ctrl-C/Ctrl-V.
Your one-stop place for all your GIMP needs: gimp-forum.net
Reply
#9
Please, note that there are 2 more hidden fields that you must supply as parameters to the POST request - form_id and form_build_id
Also username field is actually 'email', not 'username'

import requests
from bs4 import BeautifulSoup
 
USERNAME = '[email protected]'
PASSWORD = 'mypassword'
FORM_BUILD_ID='form-c4aeea083a82fdae7d43562ee8cafeb7'
FORM_ID = 'packt_user_login_form'

BASE_URL = 'https://www.packtpub.com'
PROMO_URL = 'https://www.packtpub.com/packt/offers/free-learning'
 
session = requests.session()
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2810.1 Safari/537.36'}
session.post(BASE_URL, {"email": USERNAME, "password": PASSWORD, 'form_build_id':FORM_BUILD_ID, 'form_id':FORM_ID}, headers=headers) #'
 
response = session.get(PROMO_URL, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
current_offer_href = BASE_URL + soup.find("div", {"class": "free-ebook"}).a['href']
print(current_offer_href)
print(session.get(current_offer_href, headers=headers))
my_account_url = BASE_URL+ '/account' #https://www.packtpub.com/account
response = session.get(my_account_url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
print soup.find('div', class_='menu-account').find('h1').text
Output:
https://www.packtpub.com/freelearning-claim/8294/21478 <Response [200]> Your Name
I'm not sure if form_build_id change over time
Reply
#10
(Nov-14-2016, 11:06 AM)buran Wrote: Please, note that there are 2 more hidden fields that you must supply as parameters to the POST request - form_id and form_build_id
Also username field is actually 'email', not 'username'

import requests
from bs4 import BeautifulSoup
 
USERNAME = '[email protected]'
PASSWORD = 'mypassword'
FORM_BUILD_ID='form-c4aeea083a82fdae7d43562ee8cafeb7'
FORM_ID = 'packt_user_login_form'

BASE_URL = 'https://www.packtpub.com'
PROMO_URL = 'https://www.packtpub.com/packt/offers/free-learning'
 
session = requests.session()
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2810.1 Safari/537.36'}
session.post(BASE_URL, {"email": USERNAME, "password": PASSWORD, 'form_build_id':FORM_BUILD_ID, 'form_id':FORM_ID}, headers=headers) #'
 
response = session.get(PROMO_URL, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
current_offer_href = BASE_URL + soup.find("div", {"class": "free-ebook"}).a['href']
print(current_offer_href)
print(session.get(current_offer_href, headers=headers))
my_account_url = BASE_URL+ '/account' #https://www.packtpub.com/account
response = session.get(my_account_url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
print soup.find('div', class_='menu-account').find('h1').text
Output:
https://www.packtpub.com/freelearning-claim/8294/21478 <Response [200]> Your Name
I'm not sure if form_build_id change over time

Thank you. I really didn't spot the form_build_id and form_id fields. It's now fully working. Just like you, I also don't know if those fields change over time, but it appears they do, because mine is different than yours. I'm getting them on each call, so it doesn't really matter. My code:


#!/usr/bin/env python
# -*- coding: utf-8 -*-

import requests
from bs4 import BeautifulSoup

USERNAME = '[email protected]'
PASSWORD = 'secret'

BASE_URL = 'https://www.packtpub.com'
PROMO_URL = 'https://www.packtpub.com/packt/offers/free-learning'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2810.1 Safari/537.36'}

session = requests.session()
root_page = session.get(BASE_URL, headers=headers)
soup = BeautifulSoup(root_page.text, 'html.parser')
FORM_BUILD_ID = soup.find("input", {"name": "form_build_id"})['value']
FORM_ID = soup.find("input", {"id": "edit-packt-user-login-form"})['value']

session.post(BASE_URL, {"email": USERNAME, "password": PASSWORD, 'form_build_id': FORM_BUILD_ID, 'form_id': FORM_ID}, headers=headers)
promo_page = session.get(PROMO_URL, headers=headers)
soup = BeautifulSoup(promo_page.text, 'html.parser')
current_offer_href = BASE_URL + soup.find("div", {"class": "free-ebook"}).a['href']
print(session.get(current_offer_href, headers=headers))
Output:

<Response [200]>

Process finished with exit code 0
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  POST requests - different requests return the same response Default_001 3 1,901 Mar-10-2022, 11:26 PM
Last Post: Default_001
  requests.post() does work Alto 1 1,993 Aug-13-2021, 07:58 AM
Last Post: ndc85430
  Follow Up: Web Calendar based Extraction AgileAVS 0 1,469 Feb-23-2020, 05:39 AM
Last Post: AgileAVS
  requests issue with post on dot_net api Heinrich 1 2,437 Jan-23-2020, 04:28 AM
Last Post: Larz60+
  Making several POST requests RayeEThompson507 1 2,554 Nov-25-2019, 08:50 PM
Last Post: micseydel
  requests post/get to HTML form mrdominikku 1 2,300 Nov-03-2019, 07:12 PM
Last Post: Larz60+
  get link and link text from table metulburr 5 6,189 Jun-13-2019, 07:50 PM
Last Post: snippsat
  Error in requests.post debanilroy 3 5,364 Sep-18-2018, 06:15 PM
Last Post: snippsat
  How do i loop through list of data from CSV file and post requests in aspx dynamics w Prince_Bhatia 1 6,042 Nov-09-2017, 02:53 PM
Last Post: heiner55

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020