(May-28-2022, 10:20 AM)Pavel_47 Wrote: Then I suppressed install staff from browser instantiation, i.e. browser = webdriver.Chrome().You can not do that,you set
This way it worked ... but Chrome browser opens. Can it be avoid ?
--headless
(not loading Browser there). The code i posted do not load Browser,it's running
headless
.(May-28-2022, 10:20 AM)Pavel_47 Wrote: Returning to the blocking issue ... if I understood you correctly, the selenium approach has a kind of blocking immunity ?Selenium automates web browsers,so do that it's act like and is a web browsers then it do net detected as other Scraping tool do.
Some site also try to block Selenium, therforew there are stuff like undetected_chromedriver
Here an other setup not using Webdriver Manager
# amazon_chrome.py from selenium import webdriver from selenium.webdriver.chrome.options import Options from selenium.webdriver.chrome.service import Service from selenium.webdriver.common.by import By import time #--| Setup options = Options() options.add_argument("--headless") options.add_argument("--window-size=1920,1080") options.add_experimental_option('excludeSwitches', ['enable-logging']) ser = Service(r"C:\cmder\bin\chromedriver.exe") browser = webdriver.Chrome(service=ser, options=options) #--| Parse or automation url = "https://www.amazon.com/Advanced-Artificial-Intelligence-Robo-Justice-Georgios-ebook/dp/B0B1H2MZKX/ref=sr_1_1?keywords=9783030982058&qid=1653563461&sr=8-1" browser.get(url) title = browser.find_element(By.CSS_SELECTOR, '#productTitle') print(title.text)Running this only get title back,it do not load Browser.
Output:λ python amazon_chrome.py
Advanced Artificial Intelligence and Robo-Justice
(May-28-2022, 10:20 AM)Pavel_47 Wrote: Another question ... blocking problem aside, does using the BeautifulSoap approach allow us to find the title so easily by searching for "productTitle" ?Not as long get detected and blocked by Amazon.
You should also check what Rules Amazon has for web-scraping.
Quote:Pretty much any e-commerce website triesblocking web scraping
services or any automated bots accessing their content.
There are two identifiers that websites use to check whether the requests being sent to their servers
originate from a genuine internet user or an automated bot.