Python Forum
Strange phenomena with Amazon_dot_com scraping
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Strange phenomena with Amazon_dot_com scraping
#1
Hello,
I observed very strange phenomena while scratching Amazon_dot_com: the code that worked an hour earlier no longer works.
Any comments ?
Thanks.
Reply
#2
Did you read my post here?
Amazon has some scraping/bot protection,as they want most people use there API.
There are Python package like Amazon Simple Product API
Some year a ago i did some simple testing on Amazon using Selenium that did work at that time.
Reply
#3
I saw it just now: I didn't receive notification on your response.
So the Amazon Simple Product API is immune to scraping/bot protection ?
Selenium ... is it also immune to scraping/bot protection ?
What is its advantage over "Amazon Simple Product API" ?
I'm not familiar with Selenium.
Are there any examples on how to scrape Amazon?
Thanks.
Reply
#4
I had a look at "Amazon Simple Product API" documentation.
Quite poor.
For example, how having ISBN number, explore book properties (title, authors, etc.)
I have not found anything that allows this job to be done.
Reply
#5
If API don't get what you want then can try to scrape.
Here is a setup with Selenium,can still use BS or Selenium.
Headless mean that not loading browser if new try both and with Selenium can enter text or push buttons,
which it not needed here as want parse out book info.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import time

#--| Setup
options = Options()
#options.add_argument("--headless")
#options.add_argument("--window-size=1980,1020")
browser = webdriver.Chrome(executable_path=r'C:\cmder\bin\chromedriver.exe', options=options)
#--| Parse or automation
url = "https://www.amazon.com/Advanced-ASP-NET-Core-Security-Vulnerabilities/dp/1484260163/ref=sr_1_1?dchild=1&keywords=9781484260166&qid=1610555878&s=books&sr=1-1"
browser.get(url)
time.sleep(2)
soup = BeautifulSoup(browser.page_source, 'lxml')
# Example of using both to parse
#use_bs4 = soup.find('div', id="detailBullets_feature_div")
#print(use_bs4.text)
print('*' * 25)
use_sel = browser.find_elements_by_css_selector('#detailBulletsWrapper_feature_div')
print(use_sel[0].text) 
Output:
************************* Product details Publisher : Apress; 1st ed. edition (October 30, 2020) Language: : English Paperback : 425 pages ISBN-10 : 1484260163 ISBN-13 : 978-1484260166 Item Weight : 1.79 pounds Dimensions : 7.01 x 0.97 x 10 inches Best Sellers Rank: #317,930 in Books (See Top 100 in Books) #79 in Microsoft .NET #119 in Microsoft C & C++ Windows Programming #398 in Computer Hacking
Reply
#6
Concerning API: I don't know what it can get because there is no documentation.
Moreover, it seems that in order to use this API, one should have Amazon account, isn't it ?

Concerning Selenium: I'm trying to adapt your example on my Ubuntu machine.
Thanks.
Reply
#7
Selenium code works.
The only problem: browser opens.
Can the request be done without opening browser.
Also, it seems that part of code that extracts useful information doesn't use BeautifulSoup at all.
Why call it ?

P.S. Today I tried to use my original code. During some time it worked properly, then yesterday problem reappeared.
Reply
#8
(Jan-17-2021, 12:39 PM)Pavel_47 Wrote: Can the request be done without opening browser.
(Jan-16-2021, 06:51 PM)snippsat Wrote: Headless mean that not loading browser
Just uncomment out options.add_argument("--headless") an will not load browser.
(Jan-17-2021, 12:39 PM)Pavel_47 Wrote: Also, it seems that part of code that extracts useful information doesn't use BeautifulSoup at all.
Same here this is basic stuff when you see # try to uncomment.
I show usage of both Selenium and BS,just comment out so don't get output of both at same times.
Pavel_47 Wrote:Concerning API: I don't know what it can get because there is no documentation.
Moreover, it seems that in order to use this API, one should have Amazon account, isn't it ?
The use API is the same in most places,so have to have a account then usually get free API-keys that can use.
Most common is to get json back,can look at example it's for YouTube but method is mostly the same in other palaces.
Pavel_47 Wrote:Concerning API: I don't know what it can get because there is no documentation.
Amazon Simple Product API the documentation is there just scroll down.
The official Amazon API doc .
Reply
#9
(Jan-17-2021, 07:22 PM)snippsat Wrote: The use API is the same in most places,so have to have a account then usually get free API-keys that can use.
It seems the account is not free: when I tried to create one, I was requested for a credit card number.
Reply
#10
(Jan-16-2021, 12:56 PM)Pavel_47 Wrote: Hello,
I observed very strange phenomena while scratching Amazon_dot_com: the code that worked an hour earlier no longer works.
Any comments ?
Thanks.

most likely because Amazon stopped you from scraping. Amazon, iirc, has protections against bots and will prevent abuse by shutting off access
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Strange BS4 Problem While Scraping RSS Feeds digitalmatic7 3 4,237 Feb-15-2018, 03:18 AM
Last Post: Larz60+
  Strange BS4 Scraping Issue digitalmatic7 1 2,404 Jan-14-2018, 04:34 PM
Last Post: wavic

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020