Requests_HTML not getting all data on Amazon

Requests_HTML not getting all data on Amazon - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html)
+--- Thread: Requests_HTML not getting all data on Amazon (/thread-38701.html)

Requests_HTML not getting all data on Amazon - aaander - Nov-15-2022

I'm working on a review scraper, and I'm troubleshooting some code to get proof of concept that what I want can be done in requests_HTML. I am running into an issue I dont understand. In random pages I am returning NoneType, object has no attribute, but the page IS valid.

While calling each page individually, I was getting all the information asked for, 10 sets of data to correspond with 10 reviews per page, for 129 pages. When I get to a certain page, in this case, page 30, and the last page 129, it stops returning the information I asked for, and instead returns NoneType's:

Quote: File "c:\Programs\Python\requests-html_test\test2.py", line 61, in <module>
print(amz.get_reviews(reviews))
File "c:\Programs\Python\requests-html_test\test2.py", line 27, in get_reviews
body = review.find('span[data-hook=review-body] span', first=True).text.replace('\n','').strip() # exchange newlines with a space
AttributeError: 'NoneType' object has no attribute 'text'

Inspecting the elements for the questionable pages shows me no change in the HTML or CSS selectors for what I am pointing to.

This is the code I'm testing:

from requests_html import HTMLSession
import time


class Reviews:

    def __init__(self, *args) -> None:
        self.asin = asin
        self.title = title
        self.pagedata = HTMLSession()
        self.headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:106.0) Gecko/20100101 Firefox/106.0'}
        self.url = f'https://www.amazon.com/{self.title}/reviews/{self.asin}/ref=cm_cr_othr_d_show_all_btm?ie=UTF8&reviewerType=all_reviews&sortBy=recent&pageNumber='

    def pagination(self, page):
        r = self.pagedata.get(self.url + str(page))    # construct review url with current page
        return r.html.find('div[data-hook=review]')    # get all review data

    def get_reviews(self, reviews):  # collects data from reviews, and appends them to total
        total = []
       
        for review in reviews:
            title = review.find('a[data-hook=review-title]', first=True).text
            rating = review.find('i[data-hook=review-star-rating] span', first=True).text
            body = review.find('span[data-hook=review-body] span', first=True).text.replace('\n','').strip()  # exchange newlines with a space for smaller formating

            data = {                                             #collecting data from for loop
                "title": title,
                'rating': rating,
                'body': body[:100]
            }

            total.append(data)
                
        return total





if __name__ == '__main__':

    with open('user_url.txt', "r") as file:  # opens a text file to pull the item's store page URL
        user_url = file.read()               # get url from txt file
        #print(user_url)
    
    _, _, _, title, _, asin, *_ = user_url.split("/")    #pulling <asin> and item <title> from given URL

   
    amz = Reviews(asin, title)  # Call with asin and title, needed to construct reviews page
    results = []                # to gather collected data

       
    for x in range(1, 29):  # pagination
        print('getting page ', x)
        time.sleep(1.0)  # a pause to test if slowing things down helps
        reviews = amz.pagination(x)
        results.append(amz.get_reviews(reviews))  # collecting reviews
       
    #reviews = amz.pagination(30)                 # to test pulling each page individually
                      
       
    #print(amz.get_reviews(reviews))
    print(results)

These are the relevant elements from the first page I can't seem to parse:

[Image: 28LK4nM.png]

Any help you can give would be appreciated.

RE: Requests_HTML not getting all data on Amazon - aaander - Nov-19-2022

I was able to fix this with exception handling, not sure why I was getting the Nonetypes in the first place, but I did get past it.

 for review in reviews:
            try:
                title = review.find('a[data-hook=review-title]', first=True).text
            except AttributeError:
                title = None
            try:
                rating = review.find('i[data-hook=review-star-rating] span', first=True).text
            except AttributeError:
                rating = None
            try:
                body = review.find('span[data-hook=review-body] span', first=True).text.replace('\n','').strip()  # exchange newlines with a space
            except AttributeError:
                body = None

            data = {                                             # dictionary formatting the data with title hooks for the analyzer to link to
                "title": title,
                'rating': rating,
                'body': body
            }

            total.append(data)
               
        return total

making this change got me past it.

I hope this helps someone else with the same issue.