Jan-15-2021, 09:35 AM
(This post was last modified: Jan-27-2021, 12:30 PM by straannick.)
Hello, I am new to Python programming and currently trying to write the very first Python program.
I am using Python 3.9 (beautifulsoup4 4.9.3, certifi 2020.12.5, chardet 4.0.0, idna 2.10, lxml 4.6.2, pip 20.3.3, requests 2.25.1, selenium 3.1 41.0, setuptools 49.2.1, soupsieve 2.1, urllib3 1.26.2), PyCharm 2020.3.2 (Community Edition) and Google Chrome on Windows 8.1.
Two questions came up:
1. I want to analyze my photo portfolio, which consists of N pages of 100 photos each.
Additionally, I can say that comparing the decoded page and the saved one ("Save as") shows significant differences (see comparison.jpg and diff.zip)
Interestingly, if you decode the page with BeautifulSoup, you get the same - the representation of the 21st image and subsequent is distorted.
2. In fact, I could extract much more information if I logged into my account, but for this I need to enter a username / password and captcha. Everything is clear to me, except for captcha which I need to checkbutton.select()
Please help me to select a checkbox and check it!
I am using Python 3.9 (beautifulsoup4 4.9.3, certifi 2020.12.5, chardet 4.0.0, idna 2.10, lxml 4.6.2, pip 20.3.3, requests 2.25.1, selenium 3.1 41.0, setuptools 49.2.1, soupsieve 2.1, urllib3 1.26.2), PyCharm 2020.3.2 (Community Edition) and Google Chrome on Windows 8.1.
Two questions came up:
1. I want to analyze my photo portfolio, which consists of N pages of 100 photos each.
url = '''https://www.shutterstock.com/ru/g/Ivanov+Oleg''' page = urlopen(url) data = page.read().decode() print(data)Then data is planned to be parsed, but the problem is that normal decoding (decode ()) of any of the pages (? Page = 1? Page = 2, etc.) ends at the 21st photo: if photo 1-20 < img class = "z_h_9d80b z_h_2f2f0", then 21-100 <img class = "z_h_9d80b" and pictures are not displayed (see portfolio.jpg), although in the original page all photos in portfolio have class = "z_h_9d80b z_h_2f2f0"
Additionally, I can say that comparing the decoded page and the saved one ("Save as") shows significant differences (see comparison.jpg and diff.zip)
Interestingly, if you decode the page with BeautifulSoup, you get the same - the representation of the 21st image and subsequent is distorted.
import requests from bs4 import BeautifulSoup . . . url_get = requests.get(url) soup = BeautifulSoup(url_get.content, 'lxml') print (soup)Another variant gives the same
headers = { 'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'accept-encoding': 'gzip, deflate, sdch, br', 'accept-language': 'en-GB,en;q=0.8,en-US;q=0.6,ml;q=0.4', 'cache-control': 'max-age=0', 'upgrade-insecure-requests': '1', 'user-agent': 'Mozilla/5.0(Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36' } url_get = requests.get(url, headers=headers) parser = url_get.content soup = BeautifulSoup(parser, "html.parser") print(soup)The question is: what could be causing this decode () behavior and how to recode these pages correctly?
2. In fact, I could extract much more information if I logged into my account, but for this I need to enter a username / password and captcha. Everything is clear to me, except for captcha which I need to checkbutton.select()
from urllib.request import urlopen from selenium import webdriver from selenium.webdriver.common.keys import Keys . . . driver = webdriver.Chrome() driver.get(url) elem_name = driver.find_element_by_name("username") elem_name.send_keys("user_х@gmail.com") elem_pass = driver.find_element_by_name("password") elem_pass.send_keys("qwerty") # doesn't work - elem_capt = driver.find_element_by_class_name("recaptcha-checkbox-checkmark") # doesn't work - elem_capt = driver.find_element_by_class_name("rc-anchor-center-item rc-anchor-checkbox-holder") elem_capt = driver.find_elements_by_class_name("recaptcha-checkbox goog-inline-block recaptcha-checkbox-unchecked rc-anchor-checkbox") #then it is not clear how to select the checkbox and check it ! elem_name.send_keys(Keys.RETURN) . . .(see captcha.jpg)
Please help me to select a checkbox and check it!