Python Forum
Parsing html page and working with checkbox (on a captcha)
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Parsing html page and working with checkbox (on a captcha)
#1
Hello, I am new to Python programming and currently trying to write the very first Python program.
I am using Python 3.9 (beautifulsoup4 4.9.3, certifi 2020.12.5, chardet 4.0.0, idna 2.10, lxml 4.6.2, pip 20.3.3, requests 2.25.1, selenium 3.1 41.0, setuptools 49.2.1, soupsieve 2.1, urllib3 1.26.2), PyCharm 2020.3.2 (Community Edition) and Google Chrome on Windows 8.1.
Two questions came up:
1. I want to analyze my photo portfolio, which consists of N pages of 100 photos each.
url = '''https://www.shutterstock.com/ru/g/Ivanov+Oleg'''
page = urlopen(url)
data = page.read().decode()
print(data)
Then data is planned to be parsed, but the problem is that normal decoding (decode ()) of any of the pages (? Page = 1? Page = 2, etc.) ends at the 21st photo: if photo 1-20 < img class = "z_h_9d80b z_h_2f2f0", then 21-100 <img class = "z_h_9d80b" and pictures are not displayed (see portfolio.jpg), although in the original page all photos in portfolio have class = "z_h_9d80b z_h_2f2f0"
Additionally, I can say that comparing the decoded page and the saved one ("Save as") shows significant differences (see comparison.jpg and diff.zip)

Interestingly, if you decode the page with BeautifulSoup, you get the same - the representation of the 21st image and subsequent is distorted.
import requests
from bs4 import BeautifulSoup
. . .
url_get = requests.get(url)
soup = BeautifulSoup(url_get.content, 'lxml')
print (soup)
Another variant gives the same
 headers = {
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'accept-encoding': 'gzip, deflate, sdch, br',
        'accept-language': 'en-GB,en;q=0.8,en-US;q=0.6,ml;q=0.4',
        'cache-control': 'max-age=0',
        'upgrade-insecure-requests': '1',
        'user-agent': 'Mozilla/5.0(Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'
    }
    url_get = requests.get(url, headers=headers)
    parser = url_get.content
    soup = BeautifulSoup(parser, "html.parser")
    print(soup)
The question is: what could be causing this decode () behavior and how to recode these pages correctly?

2. In fact, I could extract much more information if I logged into my account, but for this I need to enter a username / password and captcha. Everything is clear to me, except for captcha which I need to checkbutton.select()
from urllib.request import urlopen
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
. . .
driver = webdriver.Chrome()
driver.get(url)

elem_name = driver.find_element_by_name("username")
elem_name.send_keys("user_х@gmail.com")

elem_pass = driver.find_element_by_name("password")
elem_pass.send_keys("qwerty")

# doesn't work - elem_capt = driver.find_element_by_class_name("recaptcha-checkbox-checkmark")
# doesn't work - elem_capt = driver.find_element_by_class_name("rc-anchor-center-item rc-anchor-checkbox-holder")
elem_capt = driver.find_elements_by_class_name("recaptcha-checkbox goog-inline-block recaptcha-checkbox-unchecked rc-anchor-checkbox")

#then it is not clear how to select the checkbox and check it !

elem_name.send_keys(Keys.RETURN)
. . .
(see captcha.jpg)

Please help me to select a checkbox and check it!
Reply


Messages In This Thread
Parsing html page and working with checkbox (on a captcha) - by straannick - Jan-15-2021, 09:35 AM

Possibly Related Threads…
Thread Author Replies Views Last Post
  How to bypass Cloudflare checkbox challenge Pavel_47 1 3,658 Sep-13-2024, 03:13 PM
Last Post: kucingkembar
Photo Disable checkbox of google maps markers/labels using selenium erickkill 0 1,959 Nov-25-2021, 12:20 PM
Last Post: erickkill
  <title> django page title dynamic and other field (not working) lemonred 1 2,954 Nov-04-2021, 08:50 PM
Last Post: lemonred
  Automating Captcha form submission with Mechanize Dexty 2 4,367 Aug-03-2021, 01:02 PM
Last Post: Dexty
  HTML multi select HTML listbox with Flask/Python rfeyer 0 6,247 Mar-14-2021, 12:23 PM
Last Post: rfeyer
  Saving html page and reloading into selenium while developing all xpaths Larz60+ 4 6,026 Feb-04-2021, 07:01 AM
Last Post: jonathanwhite1
  API auto-refresh on HTML page using Flask toc 2 14,212 Dec-23-2020, 02:00 PM
Last Post: toc
  Selenium Parsing (unable to Parse page after loading) oneclick 7 7,748 Oct-30-2020, 08:13 PM
Last Post: tomalex
  Help: Beautiful Soup - Parsing HTML table ironfelix717 2 3,899 Oct-01-2020, 02:19 PM
Last Post: snippsat
  [FLASK] checkbox onclick event Mad0ck 2 6,733 May-14-2020, 09:35 AM
Last Post: Mad0ck

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020