Python Forum
Web Scraping efficiency improvement
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Web Scraping efficiency improvement
#1
Is there any way to speed up a web-scraper by having multiple computers contribute to processing a list of urls? Like computer A takes urls 1 - 500 and computer B takes urls 501 - 1000, etc. I am looking for a way to build the fastest possible web scraper with resources available to everyday people.

I am already using multiprocessing from the grequests module. Which is gevent + requets combined. If anyone knows any way to help at all I will greatly appreciate it.

This scraping does not need to be run constantly, but at a specific time each day in the morning (6 A.M.), and be done near as soon as it starts. (if that info helps). I am looking for something quick and punctual.

Also I am looking through urls for retail stores (i.e.: target, bestbuy, newegg, etc), and using it to check what items are in stock for the day.

this is a code segment for grabbing those urls in the script I'm trying to put together:

import datetime
import grequests
thread_number = 20
nnn = int(len(product_number_list)/100)
float_nnn = (len(product_number_list)/100)
# Product number list is a list of product numbers, too big for me to include the full list. Here are like three:
product_number_list = ['N82E16820232476', 'N82E16820233852', 'N82E16820313777']
base_url = 'https://www.newegg.com/Product/Product.aspx?Item={}'
url_list = []
for number in product_number_list:
    url_list.append(base_url.format(product_number_list))
# The above three lines create a list of urls.
results = []
appended_number = 0
for x in range(0, len(product_number_list), thread_number):
	attempts = 0
	while attempts < 10:
		try:
			rs = (grequests.get(url, stream=False) for url in url_list[x:x+thread_number])
			reqs = grequests.map(rs, stream=False, size=20)
			append = 'yes'
			for i in reqs:
				if i.status_code != 200:
					append = 'no'
					print('Bad Status Code. Nothing Appended.')
                    attempts += 1
					break
			if append == 'yes':
				appended_number += 1
				results.extend(reqs)
				break
        except:
			print('Something went Wrong. Try Section Failed.')
			attempts += 1
			time.sleep(5)
	if appended_number % nnn == 0:
		now = datetime.datetime.today()
		print(str(int(20*appended_number/float_nnn)) + '% of the way there at: ' + str(now.strftime("%I:%M:%S %p")))
    if attempts == 10:
		print('Failed ten times to get urls.')
		time.sleep(3600)
if len(results) != len(url_list):
    print('Results count is off. len(results) == "' + str(len(results)) + '". len(url_list) == "' + str(len(url_list)) + '".')
Any way to speed this up (and yes, I am removing the print statements later) is appreciated.

P.S. This is not my code, I grabbed part of it from here: https://stackoverflow.com/questions/4620...-grequests
and here: https://stackoverflow.com/questions/2197...-get-max-r
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Numpy Structure and Efficiency garynewport 2 683 Oct-19-2022, 10:11 PM
Last Post: paul18fr
  Efficiency with regard to nested conditionals or and statements Mark17 13 3,152 May-06-2022, 05:16 PM
Last Post: Mark17
  Random coordinate generator speed improvement saidc 0 2,046 Aug-01-2021, 11:09 PM
Last Post: saidc
  How to use vectorization instead of for loop to improve efficiency in python? PJLEMZ 4 2,379 Feb-06-2021, 09:45 AM
Last Post: paul18fr
  Function Improvement Santino 1 1,793 May-23-2020, 03:30 PM
Last Post: jefsummers
  Any suggestions to improve BuySell stock problem efficiency? mrapple2020 0 1,358 May-13-2020, 06:19 PM
Last Post: mrapple2020
  Name Mashup Program Improvement in Python rhat398 3 2,556 Apr-05-2020, 12:09 PM
Last Post: perfringo
  first try with python class, suggestion for improvement please anna 18 5,906 Nov-01-2019, 11:16 AM
Last Post: anna
  coding improvement tips vaisesumit29 1 1,797 Mar-10-2019, 05:09 PM
Last Post: stullis
  Help improve code efficiency benbrown03 9 4,335 Feb-20-2019, 03:45 PM
Last Post: ichabod801

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020