Python Forum
how to add a login to a bs4 parser-script
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
how to add a login to a bs4 parser-script
#1
dear python-experts,

first of all - i hope you are all right and all goes well.


I want to scrape a website that requires login with password first, how can I start scraping it with python using beautifulsoup4 library?
Below is what I do at the moment:

import requests
from bs4 import BeautifulSoup as BS

session = requests.Session()
session.headers.update({'User-Agent': 'Mozilla/5.0'}) # this page needs header 'User-Agent` 

url = 'https://wordpress.org//{}/'

for page in range(1, 3):
    print('\n--- PAGE:', page, '---\n')
    
    # read page with list of posts
    r = session.get(url.format(page))
but what should i do to login to Wordpress-support forums?
Note my parser-job requires login.

I found some options and i have had a closer look at - here i have added them

the first of several methods: see this way:


from bs4 import BeautifulSoup    
import urllib2 
url = urllib2.urlopen("http://www.python.org")    
content = url.read()    
soup = BeautifulSoup(content)
How should the code be changed to accommodate login? Assume that the website I want to scrape is a forum that requires login. An example is http://forum.arduino.cc/index.php
or should i use mechanize:

import mechanize
from bs4 import BeautifulSoup
import urllib2 
import cookielib

cj = cookielib.CookieJar()
br = mechanize.Browser()
br.set_cookiejar(cj)
br.open("https://id.arduino.cc/auth/login/")

br.select_form(nr=0)
br.form['username'] = 'username'
br.form['password'] = 'password.'
br.submit()
print br.response().read()
besides this we also can go this way:

# Login to website using just Python 3 Standard Library
import urllib.parse
import urllib.request
import http.cookiejar

def scraper_login():
    ####### change variables here, like URL, action URL, user, pass
    # your base URL here, will be used for headers and such, with and without https://
    base_url = 'www.example.com'
    https_base_url = 'https://' + base_url

    # here goes URL that's found inside form action='.....'
    #   adjust as needed, can be all kinds of weird stuff
    authentication_url = https_base_url + '/login'

    # username and password for login
    username = 'yourusername'
    password = 'SoMePassw0rd!'

    # we will use this string to confirm a login at end
    check_string = 'Logout'

    ####### rest of the script is logic
    # but you will need to tweak couple things maybe regarding "token" logic
    #   (can be _token or token or _token_ or secret ... etc)

    # big thing! you need a referer for most pages! and correct headers are the key
    headers={"Content-Type":"application/x-www-form-urlencoded",
    "User-agent":"Mozilla/5.0 Chrome/81.0.4044.92",    # Chrome 80+ as per web search
    "Host":base_url,
    "Origin":https_base_url,
    "Referer":https_base_url}

    # initiate the cookie jar (using : http.cookiejar and urllib.request)
    cookie_jar = http.cookiejar.CookieJar()
    opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cookie_jar))
    urllib.request.install_opener(opener)

    # first a simple request, just to get login page and parse out the token
    #       (using : urllib.request)
    request = urllib.request.Request(https_base_url)
    response = urllib.request.urlopen(request)
    contents = response.read()

    # parse the page, we look for token eg. on my page it was something like this:
    #    <input type="hidden" name="_token" value="random1234567890qwertzstring">
    #       this can probably be done better with regex and similar
    #       but I'm newb, so bear with me
    html = contents.decode("utf-8")
    # text just before start and just after end of your token string
    mark_start = '<input type="hidden" name="_token" value="'
    mark_end = '">'
    # index of those two points
    start_index = html.find(mark_start) + len(mark_start)
    end_index = html.find(mark_end, start_index)
    # and text between them is our token, store it for second step of actual login
    token = html[start_index:end_index]


and so forth ...

scraper_login()
see more here https://stackoverflow.com/questions/2310...utifulsoup

but there is even a simpler way,

a method that gets us there without selenium or mechanize, or other 3rd party tools, albeit it is semi-automated. Basically, when we login into a site in a normal way, we identify ourself in a unique way using the credentials, and the same identity is used thereafter for every other interaction, which is stored in cookies and headers, for a brief period of time.

What we need to do is use the same cookies and headers when we make our http requests, and we'll be in.

To replicate that, follow these steps:

In the browser, open the developer tools
we go to the site, and login
After the login, go to the network tab, and then refresh the page
At this point, we should see a list of requests, the top one being the actual site - and that will be our focus, because it contains the data with the identity we can use for Python and BeautifulSoup to scrape it: we now can right click the site request (the top one), hover over copy, and then copy as cURL ...



What do you suggest bere?

look forward to hear from you
Wordpress - super toolkits a. http://wpgear.org/ :: und b. https://github.com/miziomon/awesome-wordpress :: Awesome WordPress: A curated list of amazingly awesome WordPress resources and awesome python things https://github.com/vinta/awesome-python
Reply
#2
Using Selenium is usually the simplest way.
urllib,mechanize,cookiejar is older stuff that i don't use anymore.
Requests has taken over there task in better way.

It's important to inspect web-site to see what going when try to login.
Requests has some common used methods in Authentication.

As a example if i should login to this site.
import requests
from bs4 import BeautifulSoup
 
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'
}
 
params = {
    "username": "your_username",
    "password": "xxxxxxx",
    "remember": "yes",
    "submit": "Login",
    "action": "do_login",
}
 
with requests.Session() as s:
    s.post('https://python-forum.io/member.php?action=login', headers=headers, params=params)
    # logged in! session cookies saved for future requests
    response = s.get('https://python-forum.io/index.php')
    # cookies sent automatically!
    soup = BeautifulSoup(response.content, 'lxml')
    welcome = soup.find('span', class_="welcome").text
    print(welcome)
Output:
Welcome back, snippsat. You last visited: Today, 01:40 PM Log Out
The params could i not know if not inspect the login first.
Reply
#3
hello dear Snippsat,

first of all: many many thanks for the reply. I am glad to hear from you

i want to log to wordpress- the support forums. https://login.wordpress.org/?locale=en_US

cf:

<a class="ab-item" href="https://login.wordpress.org/?locale=en_US">Log In</a>
<a class="ab-item" href="https://login.wordpress.org/?locale=en_US">Log In</a>


https://login.wordpress.org/?locale=en_US
<body class="wp-core-ui login js route-root">
<script type="text/javascript">document.body.className = document.body.className.replace('no-js','js');</script>
<noscript><iframe src="https://www.googletagmanager.com/ns.html?id=GTM-P24PF4B" height="0" width="0" style="display:none;visibility:hidden"></iframe></noscript>
	
<div id="login">
	<h1><a href="https://wordpress.org/" title="WordPress.org" tabindex="-1">WordPress.org Login</a></h1>
<p class="intro">Log in to your WordPress.org account to contribute to WordPress, get help in the support forum, or rate and review themes and plugins.</p>


		<form name="loginform" id="loginform" action="https://login.wordpress.org/wp-login.php" method="post" data-submit-ready="true">
			
			<p class="login-username">
				<label for="user_login">Username or Email Address</label>
				<input type="text" name="log" id="user_login" class="input" value="" size="20">
			</p>
			<p class="login-password">
				<label for="user_pass">Password</label>
				<input type="password" name="pwd" id="user_pass" class="input" value="" size="20">
			</p>
			
			<p class="login-remember"><label><input name="rememberme" type="checkbox" id="rememberme" value="forever"> Remember Me</label></p>
			<p class="login-submit">
				<input type="submit" name="wp-submit" id="wp-submit" class="button button-primary" value="Log In">
				<input type="hidden" name="redirect_to" value="https://wordpress.org/support/plugin/wp-job-manager/">
			</p>
			
		<input type="hidden" name="_reCaptcha_v3_token" value="03AGdBq25itmMwr7dEGxc4MkXQ5bm55D9x2OHMwxe7r5Vn8L7Mjwi4l4WC3MdBJ86HKzKf3x33be1BsN3ZlnCWEXJaPXLhbIxQk2SUhpidOwIqU0eNK-dWYqFvNfFdherkBIJvvem8j7P6gdO7Z-A11vd8JUrcgPi16N2ZQXo2fCIP8gDxxlm-Uc81-wq9e2a_ovTPFz3V85-vQL0mDrLc_pdWUvNOW2HAmgbIz01TzGxanypi9ouSxdexqttMipcXO1_VxZpdsaRgOfGUHs7v79xctNQn396J9eeL7sktFQzq-2rLofxqGoR6b1NGJh9uO_By6dnfsuNAPE99PaMaL9T8H_8PvhdBxpUlJBg8wITG7_cKNhHB1zqZFFVVSsdXwLmN8Xiz-CBWA9BgL1Nk0QeXeTtTA0i14d903JYEoha3ZDTpIZKLBZR2mTYofxK76eETgTLUqO2L"></form>
<p id="nav">
	<a href="https://login.wordpress.org/lostpassword" title="Password Lost and Found">Lost password?</a> &nbsp; • &nbsp;
	<a href="https://login.wordpress.org/register" title="Create an account">Create an account</a>
</p>

<script type="text/javascript">
setTimeout( function() {
	try {
		d = document.getElementById( 'user_login' );
		d.focus();
		d.select();
	} catch( e ){}
}, 200 );
</script>


	</div>

	<div class="language-switcher">
		<form id="language-switcher" action="" method="GET">
							<input type="hidden" name="redirect_to" value="https://wordpress.org/support/plugin/wp-job-manager/">
						<label for="language-switcher-locales">
				<span aria-hidden="true" class="dashicons dashicons-translation"></span>
				<span class="screen-reader-text">Select the language:</span>
			</label>
			<select id="language-switcher-locales" name="locale">
				<option value="fa_AF">(فارسی (افغانستان</option><option value="gax">Afaan Oromoo</option><option value="af">Afrikaans</option><option value="so_SO">Afsoomaali</option><option value="arg">Aragonés</option><option value="frp">Arpitan</option><option value="ast">Asturianu</option><option value="ibo">Asụsụ Igbo</option><option value="az_TR">Azərbaycan Türkcəsi</option><option value="az">Azərbaycan dili</option><option value="id_ID">Bahasa Indonesia</option><option value="ms_MY">Bahasa Melayu</option><option value="jv_ID">Basa Jawa</option><option value="su_ID">Basa Sunda</option><option value="bs_BA">Bosanski</option><option value="bre">Brezhoneg</option><option value="ca">Català</option><option value="bal">Català (Balear)</option><option value="ceb">Cebuano</option><option value="sna">ChiShona</option><option value="pcd">Ch’ti</option><option value="co">Corsu</option><option value="me_ME">Crnogorski jezik</option><option value="cy">Cymraeg</option><option value="da_DK">Dansk</option><option value="de_DE">Deutsch</option><option value="de_CH">Deutsch (Schweiz)</option><option value="de_CH_informal">Deutsch (Schweiz, Du)</option><option value="de_DE_formal">Deutsch (Sie)</option><option value="de_AT">Deutsch (Österreich)</option><option value="dsb">Dolnoserbšćina</option><option value="et">Eesti</option><option value="en_US" selected="selected">English</option><option value="en_AU">English (Australia)</option><option value="en_CA">English (Canada)</option><option value="en_NZ">English (New Zealand)</option><option value="art_xpirate">English (Pirate)</option><option value="en_ZA">English (South Africa)</option><option value="en_GB">English (UK)</option><option value="es_ES">Español</option><option value="es_AR">Español de Argentina</option><option value="es_CL">Español de Chile</option><option value="es_CO">Español de Colombia</option><option value="es_CR">Español de Costa Rica</option><option value="es_GT">Español de Guatemala</option><option value="es_HN">Español de Honduras</option><option value="es_MX">Español de México</option><option value="es_PE">Español de Perú</option><option value="es_PR">Español de Puerto Rico</option><option value="es_DO">Español de República Dominicana</option><option value="es_UY">Español de Uruguay</option><option value="es_VE">Español de Venezuela</option><option value="eo">Esperanto</option><option value="eu">Euskara</option><option value="ewe">Eʋegbe</option><option value="fr_FR">Français</option><option value="fr_BE">Français de Belgique</option><option value="fr_CA">Français du Canada</option><option value="fur">Friulian</option><option value="fy">Frysk</option><option value="fo">Føroyskt</option><option value="ga">Gaelige</option><option value="gl_ES">Galego</option><option value="gd">Gàidhlig</option><option value="hau">Harshen Hausa</option><option value="hsb">Hornjoserbšćina</option><option value="hr">Hrvatski</option><option value="ido">Ido</option><option value="kin">Ikinyarwanda</option><option value="it_IT">Italiano</option><option value="kal">Kalaallisut</option><option value="cor">Kernewek</option><option value="sw">Kiswahili</option><option value="mfe">Kreol Morisien</option><option value="hat">Kreyol ayisyen</option><option value="kmr">Kurdî</option><option value="lv">Latviešu valoda</option><option value="lt_LT">Lietuvių kalba</option><option value="li">Limburgs</option><option value="lmo">Lombardo</option><option value="lb_LU">Lëtzebuergesch</option><option value="lij">Lìgure</option><option value="hu_HU">Magyar</option><option value="mg_MG">Malagasy</option><option value="mlt">Malti</option><option value="nl_NL">Nederlands</option><option value="nl_BE">Nederlands (België)</option><option value="nl_NL_formal">Nederlands (Formeel)</option><option value="lin">Ngala</option><option value="pcm">Nigerian Pidgin</option><option value="nb_NO">Norsk bokmål</option><option value="nn_NO">Norsk nynorsk</option><option value="oci">Occitan</option><option value="lug">Oluganda</option><option value="uz_UZ">O‘zbekcha</option><option value="pap_AW">Papiamento</option><option value="pap_CW">Papiamentu</option><option value="pl_PL">Polski</option><option value="pt_PT">Português</option><option value="pt_PT_ao90">Português (AO90)</option><option value="pt_AO">Português de Angola</option><option value="pt_BR">Português do Brasil</option><option value="fuc">Pulaar</option><option value="sq_XK">Për Kosovën Shqip</option><option value="kaa">Qaraqalpaq tili</option><option value="tah">Reo Tahiti</option><option value="ro_RO">Română</option><option value="roh">Rumantsch</option><option value="rhg">Ruáinga</option><option value="srd">Sardu</option><option value="sq">Shqip</option><option value="ssw">SiSwati</option><option value="scn">Sicilianu</option><option value="sk_SK">Slovenčina</option><option value="sl_SI">Slovenščina</option><option value="fi">Suomi</option><option value="sv_SE">Svenska</option><option value="syr">Syriac</option><option value="tl">Tagalog</option><option value="kab">Taqbaylit</option><option value="mri">Te Reo Māori</option><option value="vi">Tiếng Việt</option><option value="twd">Twents</option><option value="tuk">Türkmençe</option><option value="tr_TR">Türkçe</option><option value="wol">Wolof</option><option value="yor">Yorùbá</option><option value="xho">isiXhosa</option><option value="zul">isiZulu</option><option value="is_IS">Íslenska</option><option value="cs_CZ">Čeština</option><option value="szl">Ślōnskŏ gŏdka</option><option value="el">Ελληνικά</option><option value="bel">Беларуская мова</option><option value="bg_BG">Български</option><option value="os">Ирон</option><option value="kir">Кыргызча</option><option value="mk_MK">Македонски јазик</option><option value="mn">Монгол</option><option value="ru_RU">Русский</option><option value="sah">Сахалыы</option><option value="sr_RS">Српски језик</option><option value="tt_RU">Татар теле</option><option value="tg">Тоҷикӣ</option><option value="uk">Українська</option><option value="kk">Қазақ тілі</option><option value="hy">Հայերեն</option><option value="he_IL">עִבְרִית</option><option value="ug_CN">ئۇيغۇرچە</option><option value="ur">اردو</option><option value="arq">الدارجة الجزايرية</option><option value="ar">العربية</option><option value="ary">العربية المغربية</option><option value="bcc">بلوچی مکرانی</option><option value="skr">سرائیکی</option><option value="snd">سنڌي</option><option value="fa_IR">فارسی</option><option value="ckb">كوردی‎</option><option value="haz">هزاره گی</option><option value="ps">پښتو</option><option value="azb">گؤنئی آذربایجان</option><option value="dv">ދިވެހި</option><option value="nqo">ߒߞߏ</option><option value="ne_NP">नेपाली</option><option value="brx">बोडो‎</option><option value="sa_IN">भारतम्</option><option value="bho">भोजपुरी</option><option value="mr">मराठी</option><option value="mai">मैथिली</option><option value="hi_IN">हिन्दी</option><option value="as">অসমীয়া</option><option value="bn_BD">বাংলা</option><option value="bn_IN">বাংলা (ভারত)</option><option value="pa_IN">ਪੰਜਾਬੀ</option><option value="gu">ગુજરાતી</option><option value="ory">ଓଡ଼ିଆ</option><option value="ta_IN">தமிழ்</option><option value="ta_LK">தமிழ்</option><option value="te">తెలుగు</option><option value="kn">ಕನ್ನಡ</option><option value="ml_IN">മലയാളം</option><option value="si_LK">සිංහල</option><option value="th">ไทย</option><option value="lo">ພາສາລາວ</option><option value="bo">བོད་ཡིག</option><option value="dzo">རྫོང་ཁ</option><option value="my_MM">ဗမာစာ</option><option value="ka_GE">ქართული</option><option value="tir">ትግርኛ</option><option value="am">አማርኛ</option><option value="km">ភាសាខ្មែរ</option><option value="tzm">ⵜⴰⵎⴰⵣⵉⵖⵜ</option><option value="zh_SG">中文</option><option value="ja">日本語</option><option value="zh_CN">简体中文</option><option value="zh_TW">繁體中文</option><option value="zh_HK">香港中文版	</option><option value="ko_KR">한국어</option><option value="art_xemoji">??? (Emoji)</option>			</select>
		</form>
	</div>
	<script>
		var switcherForm  = document.getElementById( 'language-switcher' );
		var localesSelect = document.getElementById( 'language-switcher-locales' );
		localesSelect.addEventListener( 'change', function() {
			switcherForm.submit()
		} );
	</script>

<div><div class="grecaptcha-badge" data-style="bottomright" style="width: 256px; height: 60px; display: block; transition: right 0.3s ease 0s; position: fixed; bottom: 14px; right: -186px; box-shadow: gray 0px 0px 5px; border-radius: 2px; overflow: hidden;"><div class="grecaptcha-logo"><iframe src="https://www.google.com/recaptcha/api2/anchor?ar=1&amp;k=6LckXrgUAAAAANrzcMN7iy_WxvmMcseaaRW-YFts&amp;co=aHR0cHM6Ly9sb2dpbi53b3JkcHJlc3Mub3JnOjQ0Mw..&amp;hl=de&amp;v=oqtdXEs9TE9ZUAIhXNz5JBt_&amp;size=invisible&amp;cb=hj49xha7yz3z" width="256" height="60" role="presentation" name="a-1ctg1295s9ze" frameborder="0" scrolling="no" sandbox="allow-forms allow-popups allow-same-origin allow-scripts allow-top-navigation allow-modals allow-popups-to-escape-sandbox"></iframe></div><div class="grecaptcha-error"></div><textarea id="g-recaptcha-response-100000" name="g-recaptcha-response" class="g-recaptcha-response" style="width: 250px; height: 40px; border: 1px solid rgb(193, 193, 193); margin: 10px 25px; padding: 0px; resize: none; display: none;"></textarea></div><iframe style="display: none;"></iframe></div></body>
dear snippsat - iwill apply these findings of the investigation to the code you gave.

import requests
from bs4 import BeautifulSoup
  
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'
}
  
params = {
    "username": "your_username",
    "password": "xxxxxxx",
    "remember": "yes",
    "submit": "Login",
    "action": "do_login",
}
  
with requests.Session() as s:
    s.post('https://python-forum.io/member.php?action=login', headers=headers, params=params)
    # logged in! session cookies saved for future requests
    response = s.get('https://login.wordpress.org/?locale=en_US')
    # cookies sent automatically!
    soup = BeautifulSoup(response.content, 'lxml')
    welcome = soup.find('span', class_="welcome").text
    print(welcome)
this also would be a option too

wp_login = 'http://ip/wordpress/wp-login.php'
wp_admin = 'http://ip/wordpress/wp-admin/'
username = 'admin'
password = 'admin'

with requests.Session() as s:
    headers1 = { 'Cookie':'wordpress_test_cookie=WP Cookie check' }
    datas={ 
        'log':username, 'pwd':password, 'wp-submit':'Log In', 
        'redirect_to':wp_admin, 'testcookie':'1'  
    }
    s.post(wp_login, headers=headers1, data=datas)
    resp = s.get(wp_admin)
    print(resp.text)
Wordpress - super toolkits a. http://wpgear.org/ :: und b. https://github.com/miziomon/awesome-wordpress :: Awesome WordPress: A curated list of amazingly awesome WordPress resources and awesome python things https://github.com/vinta/awesome-python
Reply
#4
Is best to try with Selenium on that login,as it use Google reCaptcha V3 token when login.
If think it's Browser then can maybe bypass it.
Here is test setup you can look at.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import time

#--| Setup
options = Options()
#options.add_argument("--headless")
#options.add_argument("--window-size=1980,1020")
#options.add_argument('--disable-gpu')
browser = webdriver.Chrome(executable_path=r'C:\cmder\bin\chromedriver.exe', options=options)
#--| Parse or automation
browser.get("https://login.wordpress.org/?locale=en_US")
time.sleep(2)
user_name = browser.find_element_by_css_selector('#user_login')
user_name.send_keys("Test_user")
password = browser.find_element_by_css_selector('#user_pass')
password.send_keys("123456")
time.sleep(5)
submit = browser.find_elements_by_css_selector('#wp-submit')[0]
submit.click()

# Example send page source to BeautifulSoup or selenium for parse
soup = BeautifulSoup(browser.page_source, 'lxml')
use_bs4 = soup.find('title')
print(use_bs4.text)
#print('*' * 25)
#use_sel = browser.find_elements_by_css_selector('div > div._1vC4OE')
#print(use_sel[0].text)
So with this it try to login Test_user,i have of course not valid user/pass to test that a valid user works.
Reply
#5
hello dear Snippsat, good day


first of all: many many thanks for the help with providing some special ideas regarding selenium usage for login processes. This is very very helpful.

now i can go on and combine this with the text-mining steps that are in mind...

session.headers.update({'User-Agent': 'Mozilla/5.0'}) # this page needs header 'User-Agent` 


url = 'https://wordpress.org/plugins/html5-responsive-faq/{}/'

for page in range(1, 3):
    print('\n--- PAGE:', page, '---\n')
    
    # read page with list of posts
    r = session.get(url.format(page))

    soup = BS(r.text, 'html.parser')
    
    all_uls = soup.find('li', class_="bbp-body").find_all('ul')
    
    for number, ul in enumerate(all_uls, 1):
        
        print('\n--- post:', number, '---\n')
        
        a = ul.find('a')
        if a:
            post_url = a['href']
            post_title = a.text
            
            print('text:', post_url)
            print('href:', post_title)
            print('---------')
            
            # read page with post content
            r = session.get(post_url)
            
            sub_soup = BS(r.text, 'html.parser')
            
            post_content = sub_soup.find(class_='bbp-topic-content').get_text(strip=True, separator='\n')
            print(post_content)
many thanks for your kind help - it is much appreciated - you have helped me alot.

have a great day Smile

yours apollo
Wordpress - super toolkits a. http://wpgear.org/ :: und b. https://github.com/miziomon/awesome-wordpress :: Awesome WordPress: A curated list of amazingly awesome WordPress resources and awesome python things https://github.com/vinta/awesome-python
Reply
#6
hello dear Snippsat

after having had a closer look at the preliminaries i - note: i had to install wheel first i finally got there to test the script-

see the full story:_ finally i installed wheel - so i got back the following:

Windows PowerShell
PS C:\WINDOWS\system32> pip install -U selenium
Requirement already up-to-date: selenium in c:\program files\python37\lib\site-packages (3.141.0)
Requirement already satisfied, skipping upgrade: urllib3 in c:\program files\python37\lib\site-packages (from selenium) (1.25.8)
Could not build wheels for selenium, since package 'wheel' is not installed.
Could not build wheels for urllib3, since package 'wheel' is not installed.
WARNING: You are using pip version 20.1; however, version 20.1.1 is available.
You should consider upgrading via the 'c:\program files\python37\python.exe -m pip install --upgrade pip' command.
PS C:\WINDOWS\system32> pip install wheel
Collecting wheel
  Downloading wheel-0.34.2-py2.py3-none-any.whl (26 kB)
Installing collected packages: wheel
Successfully installed wheel-0.34.2
WARNING: You are using pip version 20.1; however, version 20.1.1 is available.
You should consider upgrading via the 'c:\program files\python37\python.exe -m pip install --upgrade pip' command.
PS C:\WINDOWS\system32>
now i will testrun the selenium code from above... and i did it: now i get back the following:


traceback (most recent call last):
File "C:\Users\Kasper\AppData\Local\Temp\atom_script_tempfiles\23970790-b56b-11ea-bc7c-ab8702b78510", line 12, in <module>
browser = webdriver.Chrome(executable_path=r'C:\cmder\bin\chromedriver.exe', options=options)
File "C:\Program Files\Python37\lib\site-packages\selenium\webdriver\chrome\webdriver.py", line 73, in __init__
self.service.start()
File "C:\Program Files\Python37\lib\site-packages\selenium\webdriver\common\service.py", line 83, in start
os.path.basename(self.path), self.start_error_message)
selenium.common.exceptions.WebDriverException: Message: 'chromedriver.exe' executable needs to be in PATH. Please see https://sites.google.com/a/chromium.org/chromedriver/home
[Finished in 0.746s]
i guess that i now have to set the paths of the chromedriver in the code.
Wordpress - super toolkits a. http://wpgear.org/ :: und b. https://github.com/miziomon/awesome-wordpress :: Awesome WordPress: A curated list of amazingly awesome WordPress resources and awesome python things https://github.com/vinta/awesome-python
Reply
#7
(Jun-23-2020, 04:08 PM)apollo Wrote: i guess that i now have to set the paths of the chromedriver in the code.
Yes,you can of course not use my path Wink
cromedriver.exe need to be in a Environment Variables Path.
Or cromedriver.exe(no path) in same folder as you run script also work.
Reply
#8
hello dear Snippsat,

many many thanks! youre just awesome! Smile

btw: - youre somewhat cool - a very cool pythonist - Wink


first of all i am allmost there. I have fixed code according your advices and tipps. That was great. now - the login works like a charme.

rom selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import time

#--| Setup
options = Options()
#options.add_argument("--headless")
#options.add_argument("--window-size=1980,1020")
#options.add_argument('--disable-gpu')
browser = webdriver.Chrome(executable_path=r'C:\chrome\chromedriver.exe', options=options)
#--| Parse or automation
browser.get("https://login.wordpress.org/?locale=en_US")
time.sleep(2)
user_name = browser.find_element_by_css_selector('#user_login')
user_name.send_keys("the username ")
password = browser.find_element_by_css_selector('#user_pass')
password.send_keys("the pass")
time.sleep(5)
submit = browser.find_elements_by_css_selector('#wp-submit')[0]
submit.click()

# Example send page source to BeautifulSoup or selenium for parse
soup = BeautifulSoup(browser.page_source, 'lxml')
use_bs4 = soup.find('title')
print(use_bs4.text)
#print('*' * 25)
#use_sel = browser.find_elements_by_css_selector('div > div._1vC4OE')
#print(use_sel[0].text)
but when i add the parser part - eg like so then i run into issues.


import requests
from bs4 import BeautifulSoup as BS
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import time

#--| Setup
options = Options()
#options.add_argument("--headless")
#options.add_argument("--window-size=1980,1020")
#options.add_argument('--disable-gpu')
browser = webdriver.Chrome(executable_path=r'C:\chrome\chromedriver.exe', options=options)
#--| Parse or automation
browser.get("https://login.wordpress.org/?locale=en_US")
time.sleep(2)
user_name = browser.find_element_by_css_selector('#user_login')
user_name.send_keys("the user name ")
password = browser.find_element_by_css_selector('#user_pass')
password.send_keys("the pass")
time.sleep(5)
submit = browser.find_elements_by_css_selector('#wp-submit')[0]
submit.click()

# Example send page source to BeautifulSoup or selenium for parse
soup = BeautifulSoup(browser.page_source, 'lxml')
use_bs4 = soup.find('title')
print(use_bs4.text)
#print('*' * 25)
#use_sel = browser.find_elements_by_css_selector('div > div._1vC4OE')
#print(use_sel[0].text)
##session.headers.update({'User-Agent': 'Mozilla/5.0'}) # this page needs header 'User-Agent`


session = requests.Session()
url = 'https://wordpress.org/plugins/html5-responsive-faq/{}/'

for page in range(1, 3):
    print('\n--- PAGE:', page, '---\n')

    # read page with list of posts
    r = session.get(url.format(page))

    soup = BS(r.text, 'html.parser')

    all_uls = soup.find('li', class_="bbp-body").find_all('ul')

    for number, ul in enumerate(all_uls, 1):

        print('\n--- post:', number, '---\n')

        a = ul.find('a')
        if a:
            post_url = a['href']
            post_title = a.text

            print('text:', post_url)
            print('href:', post_title)
            print('---------')

            # read page with post content
            r = session.get(post_url)

            sub_soup = BS(r.text, 'html.parser')

            post_content = sub_soup.find(class_='bbp-topic-content').get_text(strip=True, separator='\n')
            print(post_content)
then i get back this logs

Blog Tool, Publishing Platform, and CMS � WordPress.org
--- PAGE: 1 ---
Traceback (most recent call last):
  File "C:\Users\Kasper\Documents\_f_s_j\_mk_\_dev_\bs\___wp_forums_login_and_parsing.py", line 47, in <module>
    all_uls = soup.find('li', class_="bbp-body").find_all('ul')
AttributeError: 'NoneType' object has no attribute 'find_all'
well i guess that this is only a minor issue: i have to look after these issues - that are mentioned in the logs.

see the parser-part as a standalone script - without login.. (note: without login i get only a certain amount of data - not the full set of data.
So i need to have the login part also. Therefore you have helped me alot. The login-part works perfect. The combination at the moment not. But were allmost there Smile


see the parser-part as a standalone:

#!/usr/bin/env python3

import requests
from bs4 import BeautifulSoup as BS

session = requests.Session()
session.headers.update({'User-Agent': 'Mozilla/5.0'}) # this page needs header 'User-Agent`

url = 'https://wordpress.org/support/plugin/advanced-gutenberg/page/{}/'

for page in range(1, 3):
    print('\n--- PAGE:', page, '---\n')

    # read page with list of posts
    r = session.get(url.format(page))

    soup = BS(r.text, 'html.parser')

    all_uls = soup.find('li', class_="bbp-body").find_all('ul')

    for number, ul in enumerate(all_uls, 1):

        print('\n--- post:', number, '---\n')

        a = ul.find('a')
        if a:
            post_url = a['href']
            post_title = a.text

            print('text:', post_url)
            print('href:', post_title)
            print('---------')

            # read page with post content
            r = session.get(post_url)

            sub_soup = BS(r.text, 'html.parser')

            post_content = sub_soup.find(class_='bbp-topic-content').get_text(strip=True, separator='\n')
            print(post_content)
this gives me back the following - interesting results... ;)

--- post: 1 ---

text: https://wordpress.org/support/topic/advanced-button-with-icon/
href: Advanced Button with Icon?
---------
is it not possible to create a button with a font awesome icon to left / right?

--- post: 2 ---

text: https://wordpress.org/support/topic/expand-collapse-block/
href: Expand / Collapse block?
---------
At the very bottom I have an expandable requirements.
Do you have a better block? I would like to use one of yours if poss.
The page I need help with:

--- post: 3 ---

text: https://wordpress.org/support/topic/login-form-not-formatting-correctly/
href: Login Form Not Formatting Correctly
---------
Getting some weird formatting with the email & password fields running on outside the form.
Tried on two different sites.
Thanks

..... [,,,,,] ....

--- post: 22 ---

text: https://wordpress.org/support/topic/settings-import-export-2/
href: Settings Import & Export
---------
Traceback (most recent call last):
  File "C:\Users\Kasper\Documents\_f_s_j\_mk_\_dev_\bs\____wp_forum_parser_without_login.py", line 43, in <module>
    print(post_content)
  File "C:\Program Files\Python37\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f642' in position 95: character maps to <undefined>
[Finished in 14.129s]
so - conclusio: were allmost there - there are only minor corrections needed. later to night i will have a closer look


if you have any hint - i will be happy to hear from you

regards

apollo Wink
Wordpress - super toolkits a. http://wpgear.org/ :: und b. https://github.com/miziomon/awesome-wordpress :: Awesome WordPress: A curated list of amazingly awesome WordPress resources and awesome python things https://github.com/vinta/awesome-python
Reply
#9
You can not create a new session with Requests,most work with the session that Selenium create.
Delete all code from line 35 to end,do simpler test first to make sure login work.
Find content that only visible when logged in,and try to parse it.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import time

#--| Setup
options = Options()
#options.add_argument("--headless")
#options.add_argument("--window-size=1980,1020")
browser = webdriver.Chrome(executable_path=r'C:\cmder\bin\chromedriver.exe', options=options)
#--| Parse or automation
browser.get("https://login.wordpress.org/?locale=en_US")
time.sleep(2)
user_name = browser.find_element_by_css_selector('#user_login')
user_name.send_keys("Test_user")
password = browser.find_element_by_css_selector('#user_pass')
password.send_keys("123456")
time.sleep(3)
submit = browser.find_elements_by_css_selector('#wp-submit')[0]
submit.click()

# Example send page source to BeautifulSoup or selenium for parse
soup = BeautifulSoup(browser.page_source, 'lxml')
use_bs4 = soup.find('title')
print(use_bs4.text)

# Example if i use this forum as example
post_thread = browser.find_elements_by_xpath('//*[@id="panel"]/div[2]/div/ul[1]/li[1]/a')
print(post_thread[0].text)
Output:
Welcome back, snippsat. You last visited: Today, 07:50 PM Log Out User CP
The output show here could i only get if login works.
You most find own content as shown with eg Xpath,CSS selector,find...ect
Can not use what i have last in code as a example,
and i can of course not look at contend after logged in as i am not member at Wordpress forum.
Reply
#10
hi there dear snippsat, many many thanks for all you did!


i am currently trying to get ahead here - with the script. note - this works perfect. you can check it with the following combination:


i have created a test-account for the demo-testing of this:

login: pluginfan
pass: testpasswd123
the issus with the session are quite strange. i am musing bout a approbiate solution:


well that said, i think, we could potentially log in with Selenium, then, once that's complete, we could pass the session cookie to the parser, add it to the session, and then parse that way. What do you say - how do you like this idea?


btw: The parser-code above is yielding for conversations on wp-forums - which I would like to save in a CSV file.
There are smart ways to have the "results" that contain

author:
text:
url: - if one is given in the thread..
etc.

well we can do this with the Requests library or the urllib, with Requests we can see how to do the CSV writing which is what I am interested in.... ...saved in columns (from A to D for example) so that the values are stored in columns from A to D ( or so ) in the CSV?
I saw that there are a number of threads on this topic but none of the solutions I have read through worked for the specific situation.


result_stats[query] = soup.find(id="the wordpress-comunication-data").string

with open ('the wordpress-comunication-data.csv', 'w', newline='') as fout:
    cw = csv.writer(fout)
    for q in .....:
        cw.writerow([q, result_stats[q]])
but - besides this CSV export the first and the most important thing is to get the

a. login-part and the
b. parser-part

get it working and up and running as a single script that works with one session...

i am working on this solution - Smile



update: by the way : i have seen some guys that run into very similar issues:

Selenium login looks like it works but then BeautifulSoup output shows login page

https://stackoverflow.com/questions/5238...n-pag?rq=1


question:
Quote:I'm trying to write a script in Python to grab all of the rosters
in my fantasy football league, but you have to login to ESPN first. The code I have is below. It looks like it's working when it runs -- i.e., I see the login page come up, I see it login, and the page closes. Then when I print the soup I don't see any team rosters. I saved the soup output as an html file to see what it is and it's just the page redirecting me to login again. Do I load the page through BS4 before I try to login?


answer:
Quote:Requests you're executing via Selenium in Browser has nothing common with request you're making via urllib. Just pass username/password to your HTTP-request to request data as authorized user (no Selenium required) or use pure Selenium job (note that Selenium has enough built-in methods for page scraping) – Andersson Sep 18 '18 at 10:20


To be more specific, cookies are not shared between Selenium and urllib2 so when you make the request using urllib2 the webserver won't be able to detect your previous login. As others have stated just stick with Selenium for all HTTP requests and you should be OK

answer2 :
Quote:You are using selenium to login and then using urllib2 to open the URL which uses another session to goto the site. Get the source from selenium webdriver and then use it with BeautifulSoup and it should work.



answer 3
Try this instead of urllib2
driver.get("http://games.espn.com/ffl/leaguerosters?leagueId=11111")
# query the website and return the html to the variable 'page'
page = driver.page_source
# parse the html using beautiful soup and store in variable 'soup'
soup = BeautifulSoup(page, 'html.parser')
well - i guess that i need to digg deeper inoto all that stuff.. Smile
Wordpress - super toolkits a. http://wpgear.org/ :: und b. https://github.com/miziomon/awesome-wordpress :: Awesome WordPress: A curated list of amazingly awesome WordPress resources and awesome python things https://github.com/vinta/awesome-python
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  little parser-script crashes after doing good work for some time apollo 0 1,628 Feb-03-2021, 10:48 AM
Last Post: apollo
  Python-selenium script for automated web-login does not work hectorKJ 2 3,983 Sep-10-2019, 01:29 PM
Last Post: buran
  html parser tjnichols 9 30,887 Mar-17-2018, 11:00 PM
Last Post: tjnichols

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020