Bottom Page

Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
 [SOLVED] requests returning HTTP 404 when I follow a link after I do a POST
#1
I get one free ebook a day from Packt Publishing with their "Free Learning - Free Technology Ebooks" promo. I'm trying to automate this process. I do a POST against their root path to login, after that I do a GET on the promo URL and use BeautifulSoup 4 to get the HREF of the "claim your free ebook" link, and now I'm stuck. Here's the code:


    #!/usr/bin/env python
    # -*- coding: utf-8 -*-
    
    import requests
    from bs4 import BeautifulSoup
    
    USERNAME = 'name@email.com'
    PASSWORD = 'secret'
    BASE_URL = 'https://www.packtpub.com'
    PROMO_URL = 'https://www.packtpub.com/packt/offers/free-learning'
    
    session = requests.session()
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2810.1 Safari/537.36'}
    session.post(BASE_URL, {"username": USERNAME, "password": PASSWORD}, headers=headers)
    
    response = session.get(PROMO_URL, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')
    current_offer_href = BASE_URL + soup.find("div", {"class": "free-ebook"}).a['href']
    print(current_offer_href)
    print(session.get(current_offer_href, headers=headers))
The current_offer_href is holding the correct value, if you go to the site today (13/NOV/2016) and inspect the button you will find it. In this case, it's holding https://www.packtpub.com/freelearning-claim/17276/21478. If I try to do a GET against current_offer_href I receive <Response [404]>. In reality what I should be getting is a redirect to https://www.packtpub.com/account/my-ebooks, because that's what happen if I click the button manually on the site. What's wrong here?
Quote
#2
Please post (cut & paste) error traceback
Thank you
Quote
#3
(Nov-13-2016, 03:32 PM)Larz60+ Wrote: Please post (cut & paste) error traceback
Thank you

There's no error, just:

D:\...\python.exe D:/.../main.py
https://www.packtpub.com/freelearning-claim/17276/21478
<Response [404]>

Process finished with exit code 0

Quote
#4
They're probably blocking bots.
Feel like you're not getting the answers you want? Checkout the help/rules for things like what to include/not include in a post, how to use code tags, how to ask smart questions, and more.

Pro-tip - there's an inverse correlation between the number of lines of code posted and my enthusiasm for helping with a question :)
Quote
#5
(Nov-13-2016, 05:02 PM)micseydel Wrote: They're probably blocking bots.

Yes... First try to use a plausible http-referer in your headers, then a known user-agent, then watch your cookies.
Unless noted otherwise, code in my posts should be understood as "coding suggestions", and its use may require more neurones than the two necessary for Ctrl-C/Ctrl-V.
Your one-stop place for all your GIMP needs: gimp-forum.net
Quote
#6
Use browsers head file to make the web site thank you are a browsers


Headers

We’ll discuss here one particular HTTP header, to illustrate how to add
headers to your HTTP request.

Some websites [1] dislike being browsed by programs, or send
different versions to different browsers [2]. By default urllib
identifies itself as Python-urllib/x.y (where x and y are the
major and minor version numbers of the Python release, e.g.
Python-urllib/2.5), which may confuse the site,
or just plain not work. The way a browser identifies itself
is through the User-Agent header [3]. When you create a
Request object you can pass a dictionary of headers in.
The following example makes the same request as above,
 but identifies itself as a version of Internet Explorer [4].



import urllib.parse
import urllib.request

url = 'http://www.someserver.com/cgi-bin/register.cgi'
user_agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)'
values = {'name' : 'Michael Foord',
          'location' : 'Northampton',
          'language' : 'Python' }
headers = { 'User-Agent' : user_agent }

data = urllib.parse.urlencode(values)
data = data.encode('ascii')
req = urllib.request.Request(url, data, headers)
with urllib.request.urlopen(req) as response:
   the_page = response.read()


Quote
#7
I found the problem, the site wasn't blocking me, I was just not logged in. I'm trying to do a POST against "https://www.packtpub.com" as I can't find any /login, /signin path, but it isn't working as I wanted. To login manually one would need to visit the root site, click on "Log in" and then a bar will come from above with the fields. They don't seem to have a particular login page, so, how can I login using POST in this case?

İmage


İmage
Quote
#8
Tried a URL such as:
http://user:password@host/
?

Btw, the site should have answered with a 403, not a 404.
Unless noted otherwise, code in my posts should be understood as "coding suggestions", and its use may require more neurones than the two necessary for Ctrl-C/Ctrl-V.
Your one-stop place for all your GIMP needs: gimp-forum.net
Quote
#9
Please, note that there are 2 more hidden fields that you must supply as parameters to the POST request - form_id and form_build_id
Also username field is actually 'email', not 'username'

import requests
from bs4 import BeautifulSoup
 
USERNAME = 'me@email.com'
PASSWORD = 'mypassword'
FORM_BUILD_ID='form-c4aeea083a82fdae7d43562ee8cafeb7'
FORM_ID = 'packt_user_login_form'

BASE_URL = 'https://www.packtpub.com'
PROMO_URL = 'https://www.packtpub.com/packt/offers/free-learning'
 
session = requests.session()
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2810.1 Safari/537.36'}
session.post(BASE_URL, {"email": USERNAME, "password": PASSWORD, 'form_build_id':FORM_BUILD_ID, 'form_id':FORM_ID}, headers=headers) #'
 
response = session.get(PROMO_URL, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
current_offer_href = BASE_URL + soup.find("div", {"class": "free-ebook"}).a['href']
print(current_offer_href)
print(session.get(current_offer_href, headers=headers))
my_account_url = BASE_URL+ '/account' #https://www.packtpub.com/account
response = session.get(my_account_url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
print soup.find('div', class_='menu-account').find('h1').text
Output:
https://www.packtpub.com/freelearning-claim/8294/21478 <Response [200]> Your Name
I'm not sure if form_build_id change over time
wavic and JChris like this post
Quote
#10
(Nov-14-2016, 11:06 AM)buran Wrote: Please, note that there are 2 more hidden fields that you must supply as parameters to the POST request - form_id and form_build_id
Also username field is actually 'email', not 'username'

import requests
from bs4 import BeautifulSoup
 
USERNAME = 'me@email.com'
PASSWORD = 'mypassword'
FORM_BUILD_ID='form-c4aeea083a82fdae7d43562ee8cafeb7'
FORM_ID = 'packt_user_login_form'

BASE_URL = 'https://www.packtpub.com'
PROMO_URL = 'https://www.packtpub.com/packt/offers/free-learning'
 
session = requests.session()
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2810.1 Safari/537.36'}
session.post(BASE_URL, {"email": USERNAME, "password": PASSWORD, 'form_build_id':FORM_BUILD_ID, 'form_id':FORM_ID}, headers=headers) #'
 
response = session.get(PROMO_URL, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
current_offer_href = BASE_URL + soup.find("div", {"class": "free-ebook"}).a['href']
print(current_offer_href)
print(session.get(current_offer_href, headers=headers))
my_account_url = BASE_URL+ '/account' #https://www.packtpub.com/account
response = session.get(my_account_url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
print soup.find('div', class_='menu-account').find('h1').text
Output:
https://www.packtpub.com/freelearning-claim/8294/21478 <Response [200]> Your Name
I'm not sure if form_build_id change over time

Thank you. I really didn't spot the form_build_id and form_id fields. It's now fully working. Just like you, I also don't know if those fields change over time, but it appears they do, because mine is different than yours. I'm getting them on each call, so it doesn't really matter. My code:


#!/usr/bin/env python
# -*- coding: utf-8 -*-

import requests
from bs4 import BeautifulSoup

USERNAME = 'user@email.com'
PASSWORD = 'secret'

BASE_URL = 'https://www.packtpub.com'
PROMO_URL = 'https://www.packtpub.com/packt/offers/free-learning'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2810.1 Safari/537.36'}

session = requests.session()
root_page = session.get(BASE_URL, headers=headers)
soup = BeautifulSoup(root_page.text, 'html.parser')
FORM_BUILD_ID = soup.find("input", {"name": "form_build_id"})['value']
FORM_ID = soup.find("input", {"id": "edit-packt-user-login-form"})['value']

session.post(BASE_URL, {"email": USERNAME, "password": PASSWORD, 'form_build_id': FORM_BUILD_ID, 'form_id': FORM_ID}, headers=headers)
promo_page = session.get(PROMO_URL, headers=headers)
soup = BeautifulSoup(promo_page.text, 'html.parser')
current_offer_href = BASE_URL + soup.find("div", {"class": "free-ebook"}).a['href']
print(session.get(current_offer_href, headers=headers))
Output:

<Response [200]>

Process finished with exit code 0
nilamo likes this post
Quote

Top Page

Possibly Related Threads...
Thread Author Replies Views Last Post
  get link and link text from table metulburr 5 321 Jun-13-2019, 07:50 PM
Last Post: snippsat
  Error in requests.post debanilroy 3 1,213 Sep-18-2018, 06:15 PM
Last Post: snippsat
  How do i loop through list of data from CSV file and post requests in aspx dynamics w Prince_Bhatia 1 2,381 Nov-09-2017, 02:53 PM
Last Post: heiner55

Forum Jump:


Users browsing this thread: 1 Guest(s)