Requests_HTML not getting all data on Amazon - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html) +--- Thread: Requests_HTML not getting all data on Amazon (/thread-38701.html) |
Requests_HTML not getting all data on Amazon - aaander - Nov-15-2022 I'm working on a review scraper, and I'm troubleshooting some code to get proof of concept that what I want can be done in requests_HTML. I am running into an issue I dont understand. In random pages I am returning NoneType, object has no attribute, but the page IS valid. While calling each page individually, I was getting all the information asked for, 10 sets of data to correspond with 10 reviews per page, for 129 pages. When I get to a certain page, in this case, page 30, and the last page 129, it stops returning the information I asked for, and instead returns NoneType's: Quote: File "c:\Programs\Python\requests-html_test\test2.py", line 61, in <module> Inspecting the elements for the questionable pages shows me no change in the HTML or CSS selectors for what I am pointing to. This is the code I'm testing: from requests_html import HTMLSession import time class Reviews: def __init__(self, *args) -> None: self.asin = asin self.title = title self.pagedata = HTMLSession() self.headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:106.0) Gecko/20100101 Firefox/106.0'} self.url = f'https://www.amazon.com/{self.title}/reviews/{self.asin}/ref=cm_cr_othr_d_show_all_btm?ie=UTF8&reviewerType=all_reviews&sortBy=recent&pageNumber=' def pagination(self, page): r = self.pagedata.get(self.url + str(page)) # construct review url with current page return r.html.find('div[data-hook=review]') # get all review data def get_reviews(self, reviews): # collects data from reviews, and appends them to total total = [] for review in reviews: title = review.find('a[data-hook=review-title]', first=True).text rating = review.find('i[data-hook=review-star-rating] span', first=True).text body = review.find('span[data-hook=review-body] span', first=True).text.replace('\n','').strip() # exchange newlines with a space for smaller formating data = { #collecting data from for loop "title": title, 'rating': rating, 'body': body[:100] } total.append(data) return total if __name__ == '__main__': with open('user_url.txt', "r") as file: # opens a text file to pull the item's store page URL user_url = file.read() # get url from txt file #print(user_url) _, _, _, title, _, asin, *_ = user_url.split("/") #pulling <asin> and item <title> from given URL amz = Reviews(asin, title) # Call with asin and title, needed to construct reviews page results = [] # to gather collected data for x in range(1, 29): # pagination print('getting page ', x) time.sleep(1.0) # a pause to test if slowing things down helps reviews = amz.pagination(x) results.append(amz.get_reviews(reviews)) # collecting reviews #reviews = amz.pagination(30) # to test pulling each page individually #print(amz.get_reviews(reviews)) print(results)These are the relevant elements from the first page I can't seem to parse: Any help you can give would be appreciated. RE: Requests_HTML not getting all data on Amazon - aaander - Nov-19-2022 I was able to fix this with exception handling, not sure why I was getting the Nonetypes in the first place, but I did get past it. for review in reviews: try: title = review.find('a[data-hook=review-title]', first=True).text except AttributeError: title = None try: rating = review.find('i[data-hook=review-star-rating] span', first=True).text except AttributeError: rating = None try: body = review.find('span[data-hook=review-body] span', first=True).text.replace('\n','').strip() # exchange newlines with a space except AttributeError: body = None data = { # dictionary formatting the data with title hooks for the analyzer to link to "title": title, 'rating': rating, 'body': body } total.append(data) return totalmaking this change got me past it. I hope this helps someone else with the same issue. |