Python Forum
Requests_HTML not getting all data on Amazon
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Requests_HTML not getting all data on Amazon
#1
I'm working on a review scraper, and I'm troubleshooting some code to get proof of concept that what I want can be done in requests_HTML. I am running into an issue I dont understand. In random pages I am returning NoneType, object has no attribute, but the page IS valid.

While calling each page individually, I was getting all the information asked for, 10 sets of data to correspond with 10 reviews per page, for 129 pages. When I get to a certain page, in this case, page 30, and the last page 129, it stops returning the information I asked for, and instead returns NoneType's:

Quote: File "c:\Programs\Python\requests-html_test\test2.py", line 61, in <module>
print(amz.get_reviews(reviews))
File "c:\Programs\Python\requests-html_test\test2.py", line 27, in get_reviews
body = review.find('span[data-hook=review-body] span', first=True).text.replace('\n','').strip() # exchange newlines with a space
AttributeError: 'NoneType' object has no attribute 'text'

Inspecting the elements for the questionable pages shows me no change in the HTML or CSS selectors for what I am pointing to.

This is the code I'm testing:

from requests_html import HTMLSession
import time


class Reviews:

    def __init__(self, *args) -> None:
        self.asin = asin
        self.title = title
        self.pagedata = HTMLSession()
        self.headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:106.0) Gecko/20100101 Firefox/106.0'}
        self.url = f'https://www.amazon.com/{self.title}/reviews/{self.asin}/ref=cm_cr_othr_d_show_all_btm?ie=UTF8&reviewerType=all_reviews&sortBy=recent&pageNumber='

    def pagination(self, page):
        r = self.pagedata.get(self.url + str(page))    # construct review url with current page
        return r.html.find('div[data-hook=review]')    # get all review data

    def get_reviews(self, reviews):  # collects data from reviews, and appends them to total
        total = []
       
        for review in reviews:
            title = review.find('a[data-hook=review-title]', first=True).text
            rating = review.find('i[data-hook=review-star-rating] span', first=True).text
            body = review.find('span[data-hook=review-body] span', first=True).text.replace('\n','').strip()  # exchange newlines with a space for smaller formating

            data = {                                             #collecting data from for loop
                "title": title,
                'rating': rating,
                'body': body[:100]
            }

            total.append(data)
                
        return total





if __name__ == '__main__':

    with open('user_url.txt', "r") as file:  # opens a text file to pull the item's store page URL
        user_url = file.read()               # get url from txt file
        #print(user_url)
    
    _, _, _, title, _, asin, *_ = user_url.split("/")    #pulling <asin> and item <title> from given URL

   
    amz = Reviews(asin, title)  # Call with asin and title, needed to construct reviews page
    results = []                # to gather collected data

       
    for x in range(1, 29):  # pagination
        print('getting page ', x)
        time.sleep(1.0)  # a pause to test if slowing things down helps
        reviews = amz.pagination(x)
        results.append(amz.get_reviews(reviews))  # collecting reviews
       
    #reviews = amz.pagination(30)                 # to test pulling each page individually
                      
       
    #print(amz.get_reviews(reviews))
    print(results)
These are the relevant elements from the first page I can't seem to parse:

[Image: 28LK4nM.png]

Any help you can give would be appreciated.
Reply
#2
I was able to fix this with exception handling, not sure why I was getting the Nonetypes in the first place, but I did get past it.
 for review in reviews:
            try:
                title = review.find('a[data-hook=review-title]', first=True).text
            except AttributeError:
                title = None
            try:
                rating = review.find('i[data-hook=review-star-rating] span', first=True).text
            except AttributeError:
                rating = None
            try:
                body = review.find('span[data-hook=review-body] span', first=True).text.replace('\n','').strip()  # exchange newlines with a space
            except AttributeError:
                body = None

            data = {                                             # dictionary formatting the data with title hooks for the analyzer to link to
                "title": title,
                'rating': rating,
                'body': body
            }

            total.append(data)
               
        return total
making this change got me past it.

I hope this helps someone else with the same issue.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Getting a URL from Amazon using requests-html, or beautifulsoup aaander 1 1,676 Nov-06-2022, 10:59 PM
Last Post: snippsat
  Can't open Amazon page Pavel_47 3 3,229 Oct-21-2020, 09:13 AM
Last Post: Aspire2Inspire
  New in Python Amazon Scraping brian1425 1 2,030 Jul-10-2020, 01:00 PM
Last Post: snippsat
  error installing requests_html davidm 4 3,934 Mar-06-2020, 03:23 PM
Last Post: snippsat
  Amazon AWS - how to install the library chatterbot wpaiva 9 3,894 Feb-01-2020, 08:18 AM
Last Post: brighteningeyes
  Execute search query on Amazon website Pavel_47 7 3,477 Nov-07-2019, 10:43 AM
Last Post: snippsat

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020