Strange phenomena with Amazon_dot_com scraping

Pavel_47 · Jan-16-2021, 12:56 PM

Hello,
I observed very strange phenomena while scratching Amazon_dot_com: the code that worked an hour earlier no longer works.
Any comments ?
Thanks.

***snippsat*** · (This post was last modified: Jan-16-2021, 03:36 PM by snippsat.)

Did you read my post here?
Amazon has some scraping/bot protection,as they want most people use there API.
There are Python package like Amazon Simple Product API
Some year a ago i did some simple testing on Amazon using Selenium that did work at that time.

Pavel_47 · Jan-16-2021, 06:14 PM

I saw it just now: I didn't receive notification on your response.
So the Amazon Simple Product API is immune to scraping/bot protection ?
Selenium ... is it also immune to scraping/bot protection ?
What is its advantage over "Amazon Simple Product API" ?
I'm not familiar with Selenium.
Are there any examples on how to scrape Amazon?
Thanks.

Pavel_47 · Jan-16-2021, 06:22 PM

I had a look at "Amazon Simple Product API" documentation.
Quite poor.
For example, how having ISBN number, explore book properties (title, authors, etc.)
I have not found anything that allows this job to be done.

***snippsat*** · Jan-16-2021, 06:51 PM

If API don't get what you want then can try to scrape.
Here is a setup with Selenium,can still use BS or Selenium.
Headless mean that not loading browser if new try both and with Selenium can enter text or push buttons,
which it not needed here as want parse out book info.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import time

#--| Setup
options = Options()
#options.add_argument("--headless")
#options.add_argument("--window-size=1980,1020")
browser = webdriver.Chrome(executable_path=r'C:\cmder\bin\chromedriver.exe', options=options)
#--| Parse or automation
url = "https://www.amazon.com/Advanced-ASP-NET-Core-Security-Vulnerabilities/dp/1484260163/ref=sr_1_1?dchild=1&keywords=9781484260166&qid=1610555878&s=books&sr=1-1"
browser.get(url)
time.sleep(2)
soup = BeautifulSoup(browser.page_source, 'lxml')
# Example of using both to parse
#use_bs4 = soup.find('div', id="detailBullets_feature_div")
#print(use_bs4.text)
print('*' * 25)
use_sel = browser.find_elements_by_css_selector('#detailBulletsWrapper_feature_div')
print(use_sel[0].text)

Output:*************************
Product details
Publisher : Apress; 1st ed. edition (October 30, 2020)
Language: : English
Paperback : 425 pages
ISBN-10 : 1484260163
ISBN-13 : 978-1484260166
Item Weight : 1.79 pounds
Dimensions : 7.01 x 0.97 x 10 inches
Best Sellers Rank: #317,930 in Books (See Top 100 in Books)
#79 in Microsoft .NET
#119 in Microsoft C & C++ Windows Programming
#398 in Computer Hacking

Pavel_47 · (This post was last modified: Jan-17-2021, 09:56 AM by Pavel_47.)

Concerning API: I don't know what it can get because there is no documentation.
Moreover, it seems that in order to use this API, one should have Amazon account, isn't it ?

Concerning Selenium: I'm trying to adapt your example on my Ubuntu machine.
Thanks.

Pavel_47 · Jan-17-2021, 12:39 PM

Selenium code works.
The only problem: browser opens.
Can the request be done without opening browser.
Also, it seems that part of code that extracts useful information doesn't use BeautifulSoup at all.
Why call it ?

P.S. Today I tried to use my original code. During some time it worked properly, then yesterday problem reappeared.

***snippsat*** · (This post was last modified: Jan-17-2021, 07:22 PM by snippsat.)

(Jan-17-2021, 12:39 PM)Pavel_47 Wrote: Can the request be done without opening browser.

(Jan-16-2021, 06:51 PM)snippsat Wrote: Headless mean that not loading browser

Just uncomment out options.add_argument("--headless") an will not load browser.

(Jan-17-2021, 12:39 PM)Pavel_47 Wrote: Also, it seems that part of code that extracts useful information doesn't use BeautifulSoup at all.

Same here this is basic stuff when you see # try to uncomment.
I show usage of both Selenium and BS,just comment out so don't get output of both at same times.

Pavel_47 Wrote:Concerning API: I don't know what it can get because there is no documentation.
Moreover, it seems that in order to use this API, one should have Amazon account, isn't it ?

The use API is the same in most places,so have to have a account then usually get free API-keys that can use.
Most common is to get json back,can look at example it's for YouTube but method is mostly the same in other palaces.

Pavel_47 Wrote:Concerning API: I don't know what it can get because there is no documentation.

Amazon Simple Product API the documentation is there just scroll down.
The official Amazon API doc .

Pavel_47 · Jan-18-2021, 09:16 AM

(Jan-17-2021, 07:22 PM)snippsat Wrote: The use API is the same in most places,so have to have a account then usually get free API-keys that can use.

It seems the account is not free: when I tried to create one, I was requested for a credit card number.

pjkaka · Jan-22-2021, 10:37 AM

(Jan-16-2021, 12:56 PM)Pavel_47 Wrote: Hello,
I observed very strange phenomena while scratching Amazon_dot_com: the code that worked an hour earlier no longer works.
Any comments ?
Thanks.

most likely because Amazon stopped you from scraping. Amazon, iirc, has protections against bots and will prevent abuse by shutting off access

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Strange BS4 Problem While Scraping RSS Feeds	digitalmatic7	3	5,102	Feb-15-2018, 03:18 AM Last Post: Larz60+
	Strange BS4 Scraping Issue	digitalmatic7	1	2,994	Jan-14-2018, 04:34 PM Last Post: wavic

Strange phenomena with Amazon_dot_com scraping

User Panel Messages

Announcements