Login and access website - Printable Version

Login and access website - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html)
+--- Thread: Login and access website (/thread-8065.html)

Login and access website - mariolopes - Feb-05-2018

Hi
I want to login on this website

https://www.cpcdi.pt/Account/Login
I have the credentials and I try the following code:

import requests
import sys
import urllib.request
import re
URL = 'https://www.cpcdi.pt/Account/Login'
def main():
    # Start a session so we can have persistant cookies
    #session = requests.session(config={'verbose': sys.stderr})
    # This is the form data that the page sends when logging in
    login_data = {
        'CodCliente': 'mycode',
        'UserName': 'myuser',
        'Password': 'mypass',
        'submit':'submit',
    }

    # Authenticate
    r = session.post(URL, data=login_data)

    # Try accessing a page that requires you to be logged in
    URL='https://www.cpcdi.pt/Produtos/Referencia?referencia=8745B006AA'
    r = session.get(URL)

I got no error on this code but I'm not sure that it's work.
I need to download pictures from the address like
URL='https://www.cpcdi.pt/Produtos/Referencia?referencia=8745B006AA'

for that I try the code:

req = urllib.request.Request(URL, headers={'User-Agent': 'Mozilla/5.0'})        
htmltext = urllib.request.urlopen(req).read()
if htmltext is None:
    print("nada")
else:
    regex='<img src="/(.+?)"'
    pattern=re.compile(regex)
    imagem=pattern.findall(str(htmltext))
    print(imagem[0])
#descarrega imagem
import urllib.request
urllib.request.urlretrieve(URL+imagem[0], "local-filename.jpg")

But no luck. I got

Error:Traceback (most recent call last):
  File "C:\python-ficheiros\abrir.py", line 33, in <module>
    print(imagem[0])
IndexError: list index out of range

Any help on this matter?
Thank you

RE: Login and access website - metulburr - Feb-05-2018

If you get an indexerror on the first element, then the list is empty. which means imagem = [] and your pattern is not matching. Dont use regex to parse html, use beautifulsoup as that is what it was made for.

https://python-forum.io/Thread-Web-Scraping-part-1
https://python-forum.io/Thread-Web-scraping-part-2

you can download the image via requests

import shutil
import requests

url = 'http://example.com/img.png'
response = requests.get(url, stream=True)
with open('img.png', 'wb') as out_file:
    shutil.copyfileobj(response.raw, out_file)

RE: Login and access website - mariolopes - Feb-05-2018

Thank you for your help, but I think the problem is that I can't login with Python
In my code:

import requests
import sys
import urllib.request
import re
import shutil
URL = 'https://www.cpcdi.pt/Account/Login'
def main():
    # Start a session so we can have persistant cookies
    #session = requests.session(config={'verbose': sys.stderr})
    # This is the form data that the page sends when logging in
    login_data = {
        'CodCliente': 'xxx',
        'UserName': 'xx',
        'Password': 'xxxx',
        'submit':'submit',
    }
    # Authenticate
    r = session.post(URL, data=login_data)
    # Try accessing a page that requires you to be logged in
    URL='https://www.cpcdi.pt/Produtos/Referencia?referencia=8745B006AA'
    r = session.get(URL)
    print(r.text)

The last instrution returns nothing therefore I can't pass the login. I have the credentials and I can access with the browser but not with python. I think the submit button has no name and I don't know how to handle it.
Is that the best option to login with python?
regards

RE: Login and access website - metulburr - Feb-06-2018

Looking at the code, the blue "Entrar" button executes javascript. Which means you are going to need Selenium to click the button. The tutorial links I gave above for scraping websites has selenium examples in them too.

RE: Login and access website - snippsat - Feb-06-2018

Here a setup you can look.
If new to this always start with webdriver that you see what's going on like Chrome.
So can go headless later.
I use CSS selector like #UserName that find the user name field.
It fill out all fields and push log in button,so if all where correct i'll be logged in.

from selenium import webdriver
from bs4 import BeautifulSoup
import time

# Activate Phantom(headless) and deactivate Chrome to not load browser
#browser = webdriver.PhantomJS()
browser = webdriver.Chrome()
url = 'https://www.cpcdi.pt/Account/Login'
browser.get(url)
user_name = browser.find_element_by_css_selector('#CodCliente')
user_name.send_keys("Foo")
password = browser.find_element_by_css_selector('#UserName')
password.send_keys("Bar")
password = browser.find_element_by_css_selector('#Password')
password.send_keys("xxxxxxxxx")
time.sleep(5)
submit = browser.find_elements_by_css_selector('button.btn')
submit[0].click()
time.sleep(5)

# Give source code to BeautifulSoup
soup = BeautifulSoup(browser.page_source, 'lxml')
title = soup.select('head > title')
print(title[0].text)

RE: Login and access website - mariolopes - Feb-06-2018

Great help
Many thanks for that. But there is some strange behavoir with my code. Please look at it

from selenium import webdriver
from bs4 import BeautifulSoup
import time
from urllib.request import urlopen
# Activate Phantom(headless) and deactivate Chrome to not load browser
#browser = webdriver.PhantomJS()
browser = webdriver.Firefox()
url = 'https://www.cpcdi.pt/Account/Login'
browser.get(url)
user_name = browser.find_element_by_css_selector('#CodCliente')
user_name.send_keys("111")
password = browser.find_element_by_css_selector('#UserName')
password.send_keys("1222")
password = browser.find_element_by_css_selector('#Password')
password.send_keys("11222")
time.sleep(5)
submit = browser.find_elements_by_css_selector('button.btn')
submit[0].click()
time.sleep(5)
 
# Give source code to BeautifulSoup
goUrl="https://www.cpcdi.pt/Produtos/Referencia?referencia=8745B006AA"
browser.get(goUrl)
soup = BeautifulSoup(urlopen(goUrl), 'lxml')
for a in soup.find_all('a', href=True):
    print ("Found the URL:", a['href'])

works fine the browser goes to the link but I get the links from the first page, not from the page where the browser is. What I'm doing wrong?

RE: Login and access website - metulburr - Feb-06-2018

you dont need urlopen() if your using requests or selenium. Especially if you are using selenium to bypass javascript that urllib cannot bypass. You are basically opening a new page from python separate from selenium to pass to bs4 instead of the source from selenium

Quote:

soup = BeautifulSoup(urlopen(goUrl), 'lxml')

do this

soup = BeautifulSoup(browser.page_source, 'lxml')

PS you might need some form of delay after browser.get and page source but unsure til you check it out

RE: Login and access website - mariolopes - Feb-07-2018

Simply perfect
Thank all.