Nov-15-2022, 11:07 PM
I'm working on a review scraper, and I'm troubleshooting some code to get proof of concept that what I want can be done in requests_HTML. I am running into an issue I dont understand. In random pages I am returning NoneType, object has no attribute, but the page IS valid.
While calling each page individually, I was getting all the information asked for, 10 sets of data to correspond with 10 reviews per page, for 129 pages. When I get to a certain page, in this case, page 30, and the last page 129, it stops returning the information I asked for, and instead returns NoneType's:
Inspecting the elements for the questionable pages shows me no change in the HTML or CSS selectors for what I am pointing to.
This is the code I'm testing:
![[Image: 28LK4nM.png]](https://i.imgur.com/28LK4nM.png)
Any help you can give would be appreciated.
While calling each page individually, I was getting all the information asked for, 10 sets of data to correspond with 10 reviews per page, for 129 pages. When I get to a certain page, in this case, page 30, and the last page 129, it stops returning the information I asked for, and instead returns NoneType's:
Quote: File "c:\Programs\Python\requests-html_test\test2.py", line 61, in <module>
print(amz.get_reviews(reviews))
File "c:\Programs\Python\requests-html_test\test2.py", line 27, in get_reviews
body = review.find('span[data-hook=review-body] span', first=True).text.replace('\n','').strip() # exchange newlines with a space
AttributeError: 'NoneType' object has no attribute 'text'
Inspecting the elements for the questionable pages shows me no change in the HTML or CSS selectors for what I am pointing to.
This is the code I'm testing:
from requests_html import HTMLSession import time class Reviews: def __init__(self, *args) -> None: self.asin = asin self.title = title self.pagedata = HTMLSession() self.headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:106.0) Gecko/20100101 Firefox/106.0'} self.url = f'https://www.amazon.com/{self.title}/reviews/{self.asin}/ref=cm_cr_othr_d_show_all_btm?ie=UTF8&reviewerType=all_reviews&sortBy=recent&pageNumber=' def pagination(self, page): r = self.pagedata.get(self.url + str(page)) # construct review url with current page return r.html.find('div[data-hook=review]') # get all review data def get_reviews(self, reviews): # collects data from reviews, and appends them to total total = [] for review in reviews: title = review.find('a[data-hook=review-title]', first=True).text rating = review.find('i[data-hook=review-star-rating] span', first=True).text body = review.find('span[data-hook=review-body] span', first=True).text.replace('\n','').strip() # exchange newlines with a space for smaller formating data = { #collecting data from for loop "title": title, 'rating': rating, 'body': body[:100] } total.append(data) return total if __name__ == '__main__': with open('user_url.txt', "r") as file: # opens a text file to pull the item's store page URL user_url = file.read() # get url from txt file #print(user_url) _, _, _, title, _, asin, *_ = user_url.split("/") #pulling <asin> and item <title> from given URL amz = Reviews(asin, title) # Call with asin and title, needed to construct reviews page results = [] # to gather collected data for x in range(1, 29): # pagination print('getting page ', x) time.sleep(1.0) # a pause to test if slowing things down helps reviews = amz.pagination(x) results.append(amz.get_reviews(reviews)) # collecting reviews #reviews = amz.pagination(30) # to test pulling each page individually #print(amz.get_reviews(reviews)) print(results)These are the relevant elements from the first page I can't seem to parse:
![[Image: 28LK4nM.png]](https://i.imgur.com/28LK4nM.png)
Any help you can give would be appreciated.