Python Forum
Parsing html page and working with checkbox (on a captcha)
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Parsing html page and working with checkbox (on a captcha)
#1
Hello, I am new to Python programming and currently trying to write the very first Python program.
I am using Python 3.9 (beautifulsoup4 4.9.3, certifi 2020.12.5, chardet 4.0.0, idna 2.10, lxml 4.6.2, pip 20.3.3, requests 2.25.1, selenium 3.1 41.0, setuptools 49.2.1, soupsieve 2.1, urllib3 1.26.2), PyCharm 2020.3.2 (Community Edition) and Google Chrome on Windows 8.1.
Two questions came up:
1. I want to analyze my photo portfolio, which consists of N pages of 100 photos each.
url = '''https://www.shutterstock.com/ru/g/Ivanov+Oleg'''
page = urlopen(url)
data = page.read().decode()
print(data)
Then data is planned to be parsed, but the problem is that normal decoding (decode ()) of any of the pages (? Page = 1? Page = 2, etc.) ends at the 21st photo: if photo 1-20 < img class = "z_h_9d80b z_h_2f2f0", then 21-100 <img class = "z_h_9d80b" and pictures are not displayed (see portfolio.jpg), although in the original page all photos in portfolio have class = "z_h_9d80b z_h_2f2f0"
Additionally, I can say that comparing the decoded page and the saved one ("Save as") shows significant differences (see comparison.jpg and diff.zip)

Interestingly, if you decode the page with BeautifulSoup, you get the same - the representation of the 21st image and subsequent is distorted.
import requests
from bs4 import BeautifulSoup
. . .
url_get = requests.get(url)
soup = BeautifulSoup(url_get.content, 'lxml')
print (soup)
Another variant gives the same
 headers = {
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'accept-encoding': 'gzip, deflate, sdch, br',
        'accept-language': 'en-GB,en;q=0.8,en-US;q=0.6,ml;q=0.4',
        'cache-control': 'max-age=0',
        'upgrade-insecure-requests': '1',
        'user-agent': 'Mozilla/5.0(Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'
    }
    url_get = requests.get(url, headers=headers)
    parser = url_get.content
    soup = BeautifulSoup(parser, "html.parser")
    print(soup)
The question is: what could be causing this decode () behavior and how to recode these pages correctly?

2. In fact, I could extract much more information if I logged into my account, but for this I need to enter a username / password and captcha. Everything is clear to me, except for captcha which I need to checkbutton.select()
from urllib.request import urlopen
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
. . .
driver = webdriver.Chrome()
driver.get(url)

elem_name = driver.find_element_by_name("username")
elem_name.send_keys("user_х@gmail.com")

elem_pass = driver.find_element_by_name("password")
elem_pass.send_keys("qwerty")

# doesn't work - elem_capt = driver.find_element_by_class_name("recaptcha-checkbox-checkmark")
# doesn't work - elem_capt = driver.find_element_by_class_name("rc-anchor-center-item rc-anchor-checkbox-holder")
elem_capt = driver.find_elements_by_class_name("recaptcha-checkbox goog-inline-block recaptcha-checkbox-unchecked rc-anchor-checkbox")

#then it is not clear how to select the checkbox and check it !

elem_name.send_keys(Keys.RETURN)
. . .
(see captcha.jpg)

Please help me to select a checkbox and check it!
Reply
#2
I have another problem with B.S. It gets full webpage of my photo portfolio, but distorts the page content: only the presentation of the first 20 of 100 images remains the same as on the original page.
import requests
from bs4 import BeautifulSoup
. . .
url_get = requests.get(url)
soup = BeautifulSoup(url_get.content, 'lxml')
print (soup)
Interestingly, if I try to parse the page other way, the same problem appears with the first 20 images.
page = urlopen(url)
data = page.read().decode()
print(data)
See details here
buran write Jan-15-2021, 07:17 PM:
Please, don't hijack threads. I moved you post to your original thread.
Reply
#3
(Jan-15-2021, 10:04 AM)straannick Wrote: [b]buran write 8 hours ago:[/b]
Please, don't hijack threads. I moved you post to your original thread.
buran, will it be hijacking if I will not write "See details here" ?
Reply
#4
(Jan-16-2021, 04:08 AM)straannick Wrote: buran, will it be hijacking if I will not write "See details here" ?
yes, it is. Your problem is not related to the problem in the other thread. Even with "see details here" people may start comment on your problem, instead on the OP problem and it will ruin the flow of discussion.

If you don't get response after a certain amount of time, you can bump it.
Also, note that we get access denied when we try to follow the links to files on your google drive.

Also, I think bypass captcha is not trivial click on check-box from the script (otherwise it will defy the purpose of captcha). I edited the thread title so that it's clear from the start you are dealing with captcha)
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#5
(Jan-16-2021, 07:27 AM)buran Wrote:
(Jan-16-2021, 04:08 AM)straannick Wrote: buran, will it be hijacking if I will not write "See details here" ?
Even with "see details here" people may start comment on your problem, instead on the OP problem and it will ruin the flow of discussion.

I suppose, you meant "Even without "see details here""
In my case captcha has only checkbox.
10 days passed, not answers.
Can you explain, please, how to "bump"?
Reply
#6
(Jan-26-2021, 04:43 PM)straannick Wrote: I suppose, you meant "Even without "see details here""
No, I mean what I wrote - you posted and in other user thread and had a link to your original thread. even WITH see details here people may start answering in the other user's thread.
With new post in the thread you already bumped it
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#7
Take a look at this post.
So it can work similar when a new window show most switch with browser.switch_to.frame(0).
Then can find element and click and throw in a time.sleep(3) to behave like human.
I have not tested i'm a not robot captcha yet,just some tough on what i would tried first.
Reply
#8
(Jan-26-2021, 06:49 PM)buran Wrote:
(Jan-26-2021, 04:43 PM)straannick Wrote: I suppose, you meant "Even without "see details here""
No, I mean what I wrote - you posted and in other user thread and had a link to your original thread. even WITH see details here people may start answering in the other user's thread.

In this case, if you meant "with "see details here"", the word “even” is superfluous.
But then you didn't answer my question: "what if I don't write "see details here", will it be hijacking?"

Additionally, could you please explain why I am not receiving notification of replies to my message? I Set “Subscribe and receive email notification of new replies” for this message.
Reply
#9
(Jan-26-2021, 08:44 PM)snippsat Wrote: Take a look at this post.
So it can work similar when a new window show most switch with browser.switch_to.frame(0).
Then can find element and click and throw in a time.sleep(3) to behave like human.
I have not tested i'm a not robot captcha yet,just some tough on what i would tried first.

Thank you.
About problem 2.
I suppose switch_to.frame(0) is not needed, because I selected (finded) other elements successfully. The main problem for me is how to find captcha checkbox and what should I find - "recaptcha-checkbox-checkmark" or other (see captcha.jpg)?
Here is different finds with errors:
v1. elem_capt = driver.find_element_by_id("recaptcha-checkbox-checkmark")
elem_capt.select()
selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":"[id="recaptcha-checkbox-checkmark"]"}

v2. elem_capt = driver.find_element_by_class_name("recaptcha-checkbox-checkmark")
elem_capt.select()
selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":".recaptcha-checkbox-checkmark"}

v3. elem_capt = driver.find_elements_by_css_selector("recaptcha-checkbox-checkmark")
elem_capt.select()
AttributeError: 'list' object has no attribute 'select'

v3 looks preferable, but how to choose what I need from this 'list' to 'select'?
Unfortunately "print(len(elem_capt))" prints '0'

So if I write
elem_capt = driver.find_elements_by_css_selector("recaptcha-checkbox-checkmark")[0]
I got IndexError: list index out of range

- - -
Additionally, I would like to note that problem 1 is still not resolved.
Reply
#10
(Jan-27-2021, 10:14 AM)straannick Wrote: But then you didn't answer my question: "what if I don't write "see details here", will it be hijacking?"
(Jan-16-2021, 07:27 AM)buran Wrote:
Quote:buran, will it be hijacking if I will not write "See details here" ?
yes, it is. Your problem is not related to the problem in the other thread. Even with "see details here" people may start comment on your problem, instead on the OP problem and it will ruin the flow of discussion.

Please, don't play stupid or your presence here will not last much longer.
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
Photo Disable checkbox of google maps markers/labels using selenium erickkill 0 1,263 Nov-25-2021, 12:20 PM
Last Post: erickkill
  <title> django page title dynamic and other field (not working) lemonred 1 2,104 Nov-04-2021, 08:50 PM
Last Post: lemonred
  Automating Captcha form submission with Mechanize Dexty 2 3,303 Aug-03-2021, 01:02 PM
Last Post: Dexty
  HTML multi select HTML listbox with Flask/Python rfeyer 0 4,634 Mar-14-2021, 12:23 PM
Last Post: rfeyer
  Saving html page and reloading into selenium while developing all xpaths Larz60+ 4 4,186 Feb-04-2021, 07:01 AM
Last Post: jonathanwhite1
  API auto-refresh on HTML page using Flask toc 2 11,861 Dec-23-2020, 02:00 PM
Last Post: toc
  Selenium Parsing (unable to Parse page after loading) oneclick 7 6,018 Oct-30-2020, 08:13 PM
Last Post: tomalex
  Help: Beautiful Soup - Parsing HTML table ironfelix717 2 2,691 Oct-01-2020, 02:19 PM
Last Post: snippsat
  [FLASK] checkbox onclick event Mad0ck 2 4,852 May-14-2020, 09:35 AM
Last Post: Mad0ck
  Python3 + BeautifulSoup4 + lxml (HTML -> CSV) - How to loop to next HTML/new CSV Row BrandonKastning 0 2,364 Mar-22-2020, 06:10 AM
Last Post: BrandonKastning

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020