Python Forum

Full Version: Code scrape more than one time information
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
I'm beginner in python and webscraping. My objectif was to scrape 30 reviews from a tripadvisor restaurant. But when I open the file I have 301 reviews, the 30 reviews are repeated more than five times. Could you tell me what is wrong?... What am I missing? ... This is my code :
with requests.Session() as s:
        for offset in range(10,40):
            url = f'https://www.tripadvisor.fr/Restaurant_Review-g187147-d947475-Reviews-or{offset}-Le_Bouclard-Paris_Ile_de_France.html'
            r = s.get(url)
            soup = bs(r.content, 'lxml')
            reviews = soup.select('.reviewSelector')
            ids = [review.get('data-reviewid') for review in reviews]
            r = s.post(
                    'https://www.tripadvisor.fr/OverlayWidgetAjax?Mode=EXPANDED_HOTEL_REVIEWS_RESP&metaReferer=',
                    data = {'reviews': ','.join(ids), 'contextChoice': 'DETAIL'},
                    headers = {'referer': r.url}
                    )
              
            soup = bs(r.content, 'lxml')
            if not offset:
                inf_rest_name = soup.select_one('.heading').text.replace("\n","").strip()
                rest_eclf = soup.select_one('.header_links a').text.strip()
  
            for review in soup.select('.reviewSelector'):
                name_client = review.select_one('.info_text > div:first-child').text.strip()
                date_rev_cl = review.select_one('.ratingDate')['title'].strip()
                titre_rev_cl = review.select_one('.noQuotes').text.strip()
                opinion_cl = review.select_one('.partial_entry').text.replace("\n","").strip()
                row = [f"{inf_rest_name}", f"{rest_eclf}", f"{name_client}", f"{date_rev_cl}" , f"{titre_rev_cl}", f"{opinion_cl}"]
                w.writerow(row)
I tried to change the variable review for opinion_cl, because I thought that it was the error, but it shows me the same 301 reviews. I will appreciate your help.
Your loop runs 30 times, once for each number between 10 and 40.

Every number 10-19 gets redirected to 10, 20-29 get redirected to 20, and 30-39 get redirected to 30.
This means you scrape each of those pages 10 times, geting 10 duplicates for each review.

Maybe you meant for your loop to be for offset in range(10, 40, 10): instead?
Thank you so much! It works perfectly. So it want to say if I want to scrape from 220 to 890 reviews I have to put "for offset in rage(220,890,220), that's right?
No, the third argument to range() is the step, which you want to be 10 (every tenth number).
Great! thank you again!
I have other question . I need other page who has at least 1000 reviewers. I ran the code at 10h40. Now it doesn't show information scraped and I tried to run again the code and it seems to be blcked. It doesn't answer. Is it normal? what can I do to unblock the code? and take information faster?