Parsing html page and working with checkbox (on a captcha)

straannick · (This post was last modified: Jan-27-2021, 12:30 PM by straannick.)

Hello, I am new to Python programming and currently trying to write the very first Python program.
I am using Python 3.9 (beautifulsoup4 4.9.3, certifi 2020.12.5, chardet 4.0.0, idna 2.10, lxml 4.6.2, pip 20.3.3, requests 2.25.1, selenium 3.1 41.0, setuptools 49.2.1, soupsieve 2.1, urllib3 1.26.2), PyCharm 2020.3.2 (Community Edition) and Google Chrome on Windows 8.1.
Two questions came up:
1. I want to analyze my photo portfolio, which consists of N pages of 100 photos each.

url = '''https://www.shutterstock.com/ru/g/Ivanov+Oleg'''
page = urlopen(url)
data = page.read().decode()
print(data)

Then data is planned to be parsed, but the problem is that normal decoding (decode ()) of any of the pages (? Page = 1? Page = 2, etc.) ends at the 21st photo: if photo 1-20 < img class = "z_h_9d80b z_h_2f2f0", then 21-100 <img class = "z_h_9d80b" and pictures are not displayed (see portfolio.jpg), although in the original page all photos in portfolio have class = "z_h_9d80b z_h_2f2f0"
Additionally, I can say that comparing the decoded page and the saved one ("Save as") shows significant differences (see comparison.jpg and diff.zip)

Interestingly, if you decode the page with BeautifulSoup, you get the same - the representation of the 21st image and subsequent is distorted.

import requests
from bs4 import BeautifulSoup
. . .
url_get = requests.get(url)
soup = BeautifulSoup(url_get.content, 'lxml')
print (soup)

Another variant gives the same

 headers = {
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'accept-encoding': 'gzip, deflate, sdch, br',
        'accept-language': 'en-GB,en;q=0.8,en-US;q=0.6,ml;q=0.4',
        'cache-control': 'max-age=0',
        'upgrade-insecure-requests': '1',
        'user-agent': 'Mozilla/5.0(Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'
    }
    url_get = requests.get(url, headers=headers)
    parser = url_get.content
    soup = BeautifulSoup(parser, "html.parser")
    print(soup)

The question is: what could be causing this decode () behavior and how to recode these pages correctly?

2. In fact, I could extract much more information if I logged into my account, but for this I need to enter a username / password and captcha. Everything is clear to me, except for captcha which I need to checkbutton.select()

from urllib.request import urlopen
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
. . .
driver = webdriver.Chrome()
driver.get(url)

elem_name = driver.find_element_by_name("username")
elem_name.send_keys("user_х@gmail.com")

elem_pass = driver.find_element_by_name("password")
elem_pass.send_keys("qwerty")

# doesn't work - elem_capt = driver.find_element_by_class_name("recaptcha-checkbox-checkmark")
# doesn't work - elem_capt = driver.find_element_by_class_name("rc-anchor-center-item rc-anchor-checkbox-holder")
elem_capt = driver.find_elements_by_class_name("recaptcha-checkbox goog-inline-block recaptcha-checkbox-unchecked rc-anchor-checkbox")

#then it is not clear how to select the checkbox and check it !

elem_name.send_keys(Keys.RETURN)
. . .

(see captcha.jpg)

Please help me to select a checkbox and check it!

straannick · (This post was last modified: Jan-15-2021, 07:17 PM by buran.)

I have another problem with B.S. It gets full webpage of my photo portfolio, but distorts the page content: only the presentation of the first 20 of 100 images remains the same as on the original page.

import requests
from bs4 import BeautifulSoup
. . .
url_get = requests.get(url)
soup = BeautifulSoup(url_get.content, 'lxml')
print (soup)

Interestingly, if I try to parse the page other way, the same problem appears with the first 20 images.

page = urlopen(url)
data = page.read().decode()
print(data)

See details here

buran write Jan-15-2021, 07:17 PM:
Please, don't hijack threads. I moved you post to your original thread.

straannick · Jan-16-2021, 04:08 AM

(Jan-15-2021, 10:04 AM)straannick Wrote: [b]buran write 8 hours ago:[/b]
Please, don't hijack threads. I moved you post to your original thread.

buran, will it be hijacking if I will not write "See details here" ?

**buran** · (This post was last modified: Jan-16-2021, 07:27 AM by buran.)

(Jan-16-2021, 04:08 AM)straannick Wrote: buran, will it be hijacking if I will not write "See details here" ?

yes, it is. Your problem is not related to the problem in the other thread. Even with "see details here" people may start comment on your problem, instead on the OP problem and it will ruin the flow of discussion.

If you don't get response after a certain amount of time, you can bump it.
Also, note that we get access denied when we try to follow the links to files on your google drive.

Also, I think bypass captcha is not trivial click on check-box from the script (otherwise it will defy the purpose of captcha). I edited the thread title so that it's clear from the start you are dealing with captcha)

straannick · Jan-26-2021, 04:43 PM

(Jan-16-2021, 07:27 AM)buran Wrote:
(Jan-16-2021, 04:08 AM)straannick Wrote: buran, will it be hijacking if I will not write "See details here" ?
Even with "see details here" people may start comment on your problem, instead on the OP problem and it will ruin the flow of discussion.

I suppose, you meant "Even without "see details here""
In my case captcha has only checkbox.
10 days passed, not answers.
Can you explain, please, how to "bump"?

**buran** · Jan-26-2021, 06:49 PM

(Jan-26-2021, 04:43 PM)straannick Wrote: I suppose, you meant "Even without "see details here""

No, I mean what I wrote - you posted and in other user thread and had a link to your original thread. even WITH see details here people may start answering in the other user's thread.
With new post in the thread you already bumped it

***snippsat*** · (This post was last modified: Jan-26-2021, 08:45 PM by snippsat.)

Take a look at this post.
So it can work similar when a new window show most switch with browser.switch_to.frame(0).
Then can find element and click and throw in a time.sleep(3) to behave like human.
I have not tested i'm a not robot captcha yet,just some tough on what i would tried first.

straannick · Jan-27-2021, 10:14 AM

(Jan-26-2021, 06:49 PM)buran Wrote:
(Jan-26-2021, 04:43 PM)straannick Wrote: I suppose, you meant "Even without "see details here""
No, I mean what I wrote - you posted and in other user thread and had a link to your original thread. even WITH see details here people may start answering in the other user's thread.

In this case, if you meant "with "see details here"", the word “even” is superfluous.
But then you didn't answer my question: "what if I don't write "see details here", will it be hijacking?"

Additionally, could you please explain why I am not receiving notification of replies to my message? I Set “Subscribe and receive email notification of new replies” for this message.

straannick · (This post was last modified: Jan-27-2021, 11:27 AM by straannick.)

(Jan-26-2021, 08:44 PM)snippsat Wrote: Take a look at this post.
So it can work similar when a new window show most switch with browser.switch_to.frame(0).
Then can find element and click and throw in a time.sleep(3) to behave like human.
I have not tested i'm a not robot captcha yet,just some tough on what i would tried first.

Thank you.
About problem 2.
I suppose switch_to.frame(0) is not needed, because I selected (finded) other elements successfully. The main problem for me is how to find captcha checkbox and what should I find - "recaptcha-checkbox-checkmark" or other (see captcha.jpg)?
Here is different finds with errors:
v1. elem_capt = driver.find_element_by_id("recaptcha-checkbox-checkmark")
elem_capt.select()
selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":"[id="recaptcha-checkbox-checkmark"]"}

v2. elem_capt = driver.find_element_by_class_name("recaptcha-checkbox-checkmark")
elem_capt.select()
selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":".recaptcha-checkbox-checkmark"}

v3. elem_capt = driver.find_elements_by_css_selector("recaptcha-checkbox-checkmark")
elem_capt.select()
AttributeError: 'list' object has no attribute 'select'

v3 looks preferable, but how to choose what I need from this 'list' to 'select'?
Unfortunately "print(len(elem_capt))" prints '0'

So if I write
elem_capt = driver.find_elements_by_css_selector("recaptcha-checkbox-checkmark")[0]
I got IndexError: list index out of range

- - -
Additionally, I would like to note that problem 1 is still not resolved.

**buran** · (This post was last modified: Jan-27-2021, 11:53 AM by buran.)

(Jan-27-2021, 10:14 AM)straannick Wrote: But then you didn't answer my question: "what if I don't write "see details here", will it be hijacking?"

(Jan-16-2021, 07:27 AM)buran Wrote:
Quote:buran, will it be hijacking if I will not write "See details here" ?
yes, it is. Your problem is not related to the problem in the other thread. Even with "see details here" people may start comment on your problem, instead on the OP problem and it will ruin the flow of discussion.

Please, don't play stupid or your presence here will not last much longer.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Disable checkbox of google maps markers/labels using selenium	erickkill	0	1,263	Nov-25-2021, 12:20 PM Last Post: erickkill
	<title> django page title dynamic and other field (not working)	lemonred	1	2,104	Nov-04-2021, 08:50 PM Last Post: lemonred
	Automating Captcha form submission with Mechanize	Dexty	2	3,303	Aug-03-2021, 01:02 PM Last Post: Dexty
	HTML multi select HTML listbox with Flask/Python	rfeyer	0	4,634	Mar-14-2021, 12:23 PM Last Post: rfeyer
	Saving html page and reloading into selenium while developing all xpaths	Larz60+	4	4,186	Feb-04-2021, 07:01 AM Last Post: jonathanwhite1
	API auto-refresh on HTML page using Flask	toc	2	11,861	Dec-23-2020, 02:00 PM Last Post: toc
	Selenium Parsing (unable to Parse page after loading)	oneclick	7	6,018	Oct-30-2020, 08:13 PM Last Post: tomalex
	Help: Beautiful Soup - Parsing HTML table	ironfelix717	2	2,691	Oct-01-2020, 02:19 PM Last Post: snippsat
	[FLASK] checkbox onclick event	Mad0ck	2	4,852	May-14-2020, 09:35 AM Last Post: Mad0ck
	Python3 + BeautifulSoup4 + lxml (HTML -> CSV) - How to loop to next HTML/new CSV Row	BrandonKastning	0	2,364	Mar-22-2020, 06:10 AM Last Post: BrandonKastning

Parsing html page and working with checkbox (on a captcha)

User Panel Messages

Announcements