Python Forum
how to add a login to a bs4 parser-script
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
how to add a login to a bs4 parser-script
#1
dear python-experts,

first of all - i hope you are all right and all goes well.


I want to scrape a website that requires login with password first, how can I start scraping it with python using beautifulsoup4 library?
Below is what I do at the moment:

import requests
from bs4 import BeautifulSoup as BS

session = requests.Session()
session.headers.update({'User-Agent': 'Mozilla/5.0'}) # this page needs header 'User-Agent` 

url = 'https://wordpress.org//{}/'

for page in range(1, 3):
    print('\n--- PAGE:', page, '---\n')
    
    # read page with list of posts
    r = session.get(url.format(page))
but what should i do to login to Wordpress-support forums?
Note my parser-job requires login.

I found some options and i have had a closer look at - here i have added them

the first of several methods: see this way:


from bs4 import BeautifulSoup    
import urllib2 
url = urllib2.urlopen("http://www.python.org")    
content = url.read()    
soup = BeautifulSoup(content)
How should the code be changed to accommodate login? Assume that the website I want to scrape is a forum that requires login. An example is http://forum.arduino.cc/index.php
or should i use mechanize:

import mechanize
from bs4 import BeautifulSoup
import urllib2 
import cookielib

cj = cookielib.CookieJar()
br = mechanize.Browser()
br.set_cookiejar(cj)
br.open("https://id.arduino.cc/auth/login/")

br.select_form(nr=0)
br.form['username'] = 'username'
br.form['password'] = 'password.'
br.submit()
print br.response().read()
besides this we also can go this way:

# Login to website using just Python 3 Standard Library
import urllib.parse
import urllib.request
import http.cookiejar

def scraper_login():
    ####### change variables here, like URL, action URL, user, pass
    # your base URL here, will be used for headers and such, with and without https://
    base_url = 'www.example.com'
    https_base_url = 'https://' + base_url

    # here goes URL that's found inside form action='.....'
    #   adjust as needed, can be all kinds of weird stuff
    authentication_url = https_base_url + '/login'

    # username and password for login
    username = 'yourusername'
    password = 'SoMePassw0rd!'

    # we will use this string to confirm a login at end
    check_string = 'Logout'

    ####### rest of the script is logic
    # but you will need to tweak couple things maybe regarding "token" logic
    #   (can be _token or token or _token_ or secret ... etc)

    # big thing! you need a referer for most pages! and correct headers are the key
    headers={"Content-Type":"application/x-www-form-urlencoded",
    "User-agent":"Mozilla/5.0 Chrome/81.0.4044.92",    # Chrome 80+ as per web search
    "Host":base_url,
    "Origin":https_base_url,
    "Referer":https_base_url}

    # initiate the cookie jar (using : http.cookiejar and urllib.request)
    cookie_jar = http.cookiejar.CookieJar()
    opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cookie_jar))
    urllib.request.install_opener(opener)

    # first a simple request, just to get login page and parse out the token
    #       (using : urllib.request)
    request = urllib.request.Request(https_base_url)
    response = urllib.request.urlopen(request)
    contents = response.read()

    # parse the page, we look for token eg. on my page it was something like this:
    #    <input type="hidden" name="_token" value="random1234567890qwertzstring">
    #       this can probably be done better with regex and similar
    #       but I'm newb, so bear with me
    html = contents.decode("utf-8")
    # text just before start and just after end of your token string
    mark_start = '<input type="hidden" name="_token" value="'
    mark_end = '">'
    # index of those two points
    start_index = html.find(mark_start) + len(mark_start)
    end_index = html.find(mark_end, start_index)
    # and text between them is our token, store it for second step of actual login
    token = html[start_index:end_index]


and so forth ...

scraper_login()
see more here https://stackoverflow.com/questions/2310...utifulsoup

but there is even a simpler way,

a method that gets us there without selenium or mechanize, or other 3rd party tools, albeit it is semi-automated. Basically, when we login into a site in a normal way, we identify ourself in a unique way using the credentials, and the same identity is used thereafter for every other interaction, which is stored in cookies and headers, for a brief period of time.

What we need to do is use the same cookies and headers when we make our http requests, and we'll be in.

To replicate that, follow these steps:

In the browser, open the developer tools
we go to the site, and login
After the login, go to the network tab, and then refresh the page
At this point, we should see a list of requests, the top one being the actual site - and that will be our focus, because it contains the data with the identity we can use for Python and BeautifulSoup to scrape it: we now can right click the site request (the top one), hover over copy, and then copy as cURL ...



What do you suggest bere?

look forward to hear from you
Wordpress - super toolkits a. http://wpgear.org/ :: und b. https://github.com/miziomon/awesome-wordpress :: Awesome WordPress: A curated list of amazingly awesome WordPress resources and awesome python things https://github.com/vinta/awesome-python
Reply


Messages In This Thread
how to add a login to a bs4 parser-script - by apollo - Jun-17-2020, 04:51 PM

Possibly Related Threads…
Thread Author Replies Views Last Post
  little parser-script crashes after doing good work for some time apollo 0 1,644 Feb-03-2021, 10:48 AM
Last Post: apollo
  Python-selenium script for automated web-login does not work hectorKJ 2 4,054 Sep-10-2019, 01:29 PM
Last Post: buran
  html parser tjnichols 9 35,396 Mar-17-2018, 11:00 PM
Last Post: tjnichols

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020