Bottom Page

Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
 Code scrape more than one time information
#1
I'm beginner in python and webscraping. My objectif was to scrape 30 reviews from a tripadvisor restaurant. But when I open the file I have 301 reviews, the 30 reviews are repeated more than five times. Could you tell me what is wrong?... What am I missing? ... This is my code :
with requests.Session() as s:
        for offset in range(10,40):
            url = f'https://www.tripadvisor.fr/Restaurant_Review-g187147-d947475-Reviews-or{offset}-Le_Bouclard-Paris_Ile_de_France.html'
            r = s.get(url)
            soup = bs(r.content, 'lxml')
            reviews = soup.select('.reviewSelector')
            ids = [review.get('data-reviewid') for review in reviews]
            r = s.post(
                    'https://www.tripadvisor.fr/OverlayWidgetAjax?Mode=EXPANDED_HOTEL_REVIEWS_RESP&metaReferer=',
                    data = {'reviews': ','.join(ids), 'contextChoice': 'DETAIL'},
                    headers = {'referer': r.url}
                    )
              
            soup = bs(r.content, 'lxml')
            if not offset:
                inf_rest_name = soup.select_one('.heading').text.replace("\n","").strip()
                rest_eclf = soup.select_one('.header_links a').text.strip()
  
            for review in soup.select('.reviewSelector'):
                name_client = review.select_one('.info_text > div:first-child').text.strip()
                date_rev_cl = review.select_one('.ratingDate')['title'].strip()
                titre_rev_cl = review.select_one('.noQuotes').text.strip()
                opinion_cl = review.select_one('.partial_entry').text.replace("\n","").strip()
                row = [f"{inf_rest_name}", f"{rest_eclf}", f"{name_client}", f"{date_rev_cl}" , f"{titre_rev_cl}", f"{opinion_cl}"]
                w.writerow(row)
I tried to change the variable review for opinion_cl, because I thought that it was the error, but it shows me the same 301 reviews. I will appreciate your help.
Quote
#2
Your loop runs 30 times, once for each number between 10 and 40.

Every number 10-19 gets redirected to 10, 20-29 get redirected to 20, and 30-39 get redirected to 30.
This means you scrape each of those pages 10 times, geting 10 duplicates for each review.

Maybe you meant for your loop to be for offset in range(10, 40, 10): instead?
Quote
#3
Thank you so much! It works perfectly. So it want to say if I want to scrape from 220 to 890 reviews I have to put "for offset in rage(220,890,220), that's right?
Quote
#4
No, the third argument to range() is the step, which you want to be 10 (every tenth number).
Quote
#5
Great! thank you again!
Quote
#6
I have other question . I need other page who has at least 1000 reviewers. I ran the code at 10h40. Now it doesn't show information scraped and I tried to run again the code and it seems to be blcked. It doesn't answer. Is it normal? what can I do to unblock the code? and take information faster?
Quote

Top Page

Possibly Related Threads...
Thread Author Replies Views Last Post
  Help to reduce time to execute the code prakash52kar 1 93 Oct-14-2019, 10:56 AM
Last Post: scidam
  Two lines of code at the same time? OTO1012 1 261 Mar-04-2019, 05:32 PM
Last Post: ichabod801
  My code is taking longer time to give result rajeshwin 4 380 Feb-20-2019, 08:18 PM
Last Post: ichabod801
  Use a block of code only one time rlinux57 14 1,110 Sep-21-2018, 12:53 PM
Last Post: rlinux57
  How to generate more MP3 files at the same time in Amazon Polly using Python code? makiwara 2 1,023 Jul-02-2018, 08:43 PM
Last Post: makiwara
  OSError: [Errno 22] Invalid argument - wasn't there last time I ran the code! meganhollie 2 2,230 Jun-11-2018, 06:01 PM
Last Post: meganhollie
  Code issue with time remaining loop. Python3 deboerdn2000 11 3,303 May-04-2017, 04:53 PM
Last Post: deboerdn2000

Forum Jump:


Users browsing this thread: 1 Guest(s)