Jun-01-2018, 08:52 PM
Is there any way to speed up a web-scraper by having multiple computers contribute to processing a list of urls? Like computer A takes urls 1 - 500 and computer B takes urls 501 - 1000, etc. I am looking for a way to build the fastest possible web scraper with resources available to everyday people.
I am already using multiprocessing from the grequests module. Which is gevent + requets combined. If anyone knows any way to help at all I will greatly appreciate it.
This scraping does not need to be run constantly, but at a specific time each day in the morning (6 A.M.), and be done near as soon as it starts. (if that info helps). I am looking for something quick and punctual.
Also I am looking through urls for retail stores (i.e.: target, bestbuy, newegg, etc), and using it to check what items are in stock for the day.
this is a code segment for grabbing those urls in the script I'm trying to put together:
P.S. This is not my code, I grabbed part of it from here: https://stackoverflow.com/questions/4620...-grequests
and here: https://stackoverflow.com/questions/2197...-get-max-r
I am already using multiprocessing from the grequests module. Which is gevent + requets combined. If anyone knows any way to help at all I will greatly appreciate it.
This scraping does not need to be run constantly, but at a specific time each day in the morning (6 A.M.), and be done near as soon as it starts. (if that info helps). I am looking for something quick and punctual.
Also I am looking through urls for retail stores (i.e.: target, bestbuy, newegg, etc), and using it to check what items are in stock for the day.
this is a code segment for grabbing those urls in the script I'm trying to put together:
import datetime import grequests thread_number = 20 nnn = int(len(product_number_list)/100) float_nnn = (len(product_number_list)/100) # Product number list is a list of product numbers, too big for me to include the full list. Here are like three: product_number_list = ['N82E16820232476', 'N82E16820233852', 'N82E16820313777'] base_url = 'https://www.newegg.com/Product/Product.aspx?Item={}' url_list = [] for number in product_number_list: url_list.append(base_url.format(product_number_list)) # The above three lines create a list of urls. results = [] appended_number = 0 for x in range(0, len(product_number_list), thread_number): attempts = 0 while attempts < 10: try: rs = (grequests.get(url, stream=False) for url in url_list[x:x+thread_number]) reqs = grequests.map(rs, stream=False, size=20) append = 'yes' for i in reqs: if i.status_code != 200: append = 'no' print('Bad Status Code. Nothing Appended.') attempts += 1 break if append == 'yes': appended_number += 1 results.extend(reqs) break except: print('Something went Wrong. Try Section Failed.') attempts += 1 time.sleep(5) if appended_number % nnn == 0: now = datetime.datetime.today() print(str(int(20*appended_number/float_nnn)) + '% of the way there at: ' + str(now.strftime("%I:%M:%S %p"))) if attempts == 10: print('Failed ten times to get urls.') time.sleep(3600) if len(results) != len(url_list): print('Results count is off. len(results) == "' + str(len(results)) + '". len(url_list) == "' + str(len(url_list)) + '".')Any way to speed this up (and yes, I am removing the print statements later) is appreciated.
P.S. This is not my code, I grabbed part of it from here: https://stackoverflow.com/questions/4620...-grequests
and here: https://stackoverflow.com/questions/2197...-get-max-r