![]() |
Login and access website - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html) +--- Thread: Login and access website (/thread-8065.html) |
Login and access website - mariolopes - Feb-05-2018 Hi I want to login on this website https://www.cpcdi.pt/Account/Login I have the credentials and I try the following code: import requests import sys import urllib.request import re URL = 'https://www.cpcdi.pt/Account/Login' def main(): # Start a session so we can have persistant cookies #session = requests.session(config={'verbose': sys.stderr}) # This is the form data that the page sends when logging in login_data = { 'CodCliente': 'mycode', 'UserName': 'myuser', 'Password': 'mypass', 'submit':'submit', } # Authenticate r = session.post(URL, data=login_data) # Try accessing a page that requires you to be logged in URL='https://www.cpcdi.pt/Produtos/Referencia?referencia=8745B006AA' r = session.get(URL)I got no error on this code but I'm not sure that it's work. I need to download pictures from the address like URL='https://www.cpcdi.pt/Produtos/Referencia?referencia=8745B006AA' for that I try the code: req = urllib.request.Request(URL, headers={'User-Agent': 'Mozilla/5.0'}) htmltext = urllib.request.urlopen(req).read() if htmltext is None: print("nada") else: regex='<img src="/(.+?)"' pattern=re.compile(regex) imagem=pattern.findall(str(htmltext)) print(imagem[0]) #descarrega imagem import urllib.request urllib.request.urlretrieve(URL+imagem[0], "local-filename.jpg")But no luck. I got Any help on this matter?Thank you RE: Login and access website - metulburr - Feb-05-2018 If you get an indexerror on the first element, then the list is empty. which means imagem = [] and your pattern is not matching. Dont use regex to parse html, use beautifulsoup as that is what it was made for. https://python-forum.io/Thread-Web-Scraping-part-1 https://python-forum.io/Thread-Web-scraping-part-2 you can download the image via requests import shutil import requests url = 'http://example.com/img.png' response = requests.get(url, stream=True) with open('img.png', 'wb') as out_file: shutil.copyfileobj(response.raw, out_file) RE: Login and access website - mariolopes - Feb-05-2018 Thank you for your help, but I think the problem is that I can't login with Python In my code: import requests import sys import urllib.request import re import shutil URL = 'https://www.cpcdi.pt/Account/Login' def main(): # Start a session so we can have persistant cookies #session = requests.session(config={'verbose': sys.stderr}) # This is the form data that the page sends when logging in login_data = { 'CodCliente': 'xxx', 'UserName': 'xx', 'Password': 'xxxx', 'submit':'submit', } # Authenticate r = session.post(URL, data=login_data) # Try accessing a page that requires you to be logged in URL='https://www.cpcdi.pt/Produtos/Referencia?referencia=8745B006AA' r = session.get(URL) print(r.text)The last instrution returns nothing therefore I can't pass the login. I have the credentials and I can access with the browser but not with python. I think the submit button has no name and I don't know how to handle it. Is that the best option to login with python? regards RE: Login and access website - metulburr - Feb-06-2018 Looking at the code, the blue "Entrar" button executes javascript. Which means you are going to need Selenium to click the button. The tutorial links I gave above for scraping websites has selenium examples in them too. RE: Login and access website - snippsat - Feb-06-2018 Here a setup you can look. If new to this always start with webdriver that you see what's going on like Chrome. So can go headless later. I use CSS selector like #UserName that find the user name field.It fill out all fields and push log in button,so if all where correct i'll be logged in. from selenium import webdriver from bs4 import BeautifulSoup import time # Activate Phantom(headless) and deactivate Chrome to not load browser #browser = webdriver.PhantomJS() browser = webdriver.Chrome() url = 'https://www.cpcdi.pt/Account/Login' browser.get(url) user_name = browser.find_element_by_css_selector('#CodCliente') user_name.send_keys("Foo") password = browser.find_element_by_css_selector('#UserName') password.send_keys("Bar") password = browser.find_element_by_css_selector('#Password') password.send_keys("xxxxxxxxx") time.sleep(5) submit = browser.find_elements_by_css_selector('button.btn') submit[0].click() time.sleep(5) # Give source code to BeautifulSoup soup = BeautifulSoup(browser.page_source, 'lxml') title = soup.select('head > title') print(title[0].text) RE: Login and access website - mariolopes - Feb-06-2018 Great help Many thanks for that. But there is some strange behavoir with my code. Please look at it from selenium import webdriver from bs4 import BeautifulSoup import time from urllib.request import urlopen # Activate Phantom(headless) and deactivate Chrome to not load browser #browser = webdriver.PhantomJS() browser = webdriver.Firefox() url = 'https://www.cpcdi.pt/Account/Login' browser.get(url) user_name = browser.find_element_by_css_selector('#CodCliente') user_name.send_keys("111") password = browser.find_element_by_css_selector('#UserName') password.send_keys("1222") password = browser.find_element_by_css_selector('#Password') password.send_keys("11222") time.sleep(5) submit = browser.find_elements_by_css_selector('button.btn') submit[0].click() time.sleep(5) # Give source code to BeautifulSoup goUrl="https://www.cpcdi.pt/Produtos/Referencia?referencia=8745B006AA" browser.get(goUrl) soup = BeautifulSoup(urlopen(goUrl), 'lxml') for a in soup.find_all('a', href=True): print ("Found the URL:", a['href'])works fine the browser goes to the link but I get the links from the first page, not from the page where the browser is. What I'm doing wrong? RE: Login and access website - metulburr - Feb-06-2018 you dont need urlopen() if your using requests or selenium. Especially if you are using selenium to bypass javascript that urllib cannot bypass. You are basically opening a new page from python separate from selenium to pass to bs4 instead of the source from selenium Quote:soup = BeautifulSoup(urlopen(goUrl), 'lxml') do this soup = BeautifulSoup(browser.page_source, 'lxml')PS you might need some form of delay after browser.get and page source but unsure til you check it out RE: Login and access website - mariolopes - Feb-07-2018 Simply perfect Thank all. |