Need help multi-threading scraping

spacedog · Apr-28-2021, 05:09 AM

Using windows 10 pro, python 3.8.8.

I have a long list of URLs to loop through to scrape. Each URL has up to 100 pages to navigate through.

This is the process:

Outer loop - loop thru the URLs (each one on a new thread)
Inner loop - loop thru the pages
For each page I use a new proxy and agent so I don't have to wait 20 seconds before going to the next page

When I start running more than 3 threads at the same time I get problem accessing the data with xpath. Here's he code on the inner loop and where I'm having problems:

response = requests.get(page_url, proxies=proxy_dict, headers=headers, timeout=5, verify=False)
time.sleep(.2) # I play around with this to see if longer sleeps give better success. Not really...
data = response.text

# Test to see if we pulled a page with data
if len(data) < 1000:     
    raise Exception("requests.get Succeeded but did not return a valid page")

#Scrape page here
tree = html.fromstring(data)
line_numbers = tree.xpath("//h2[@class='n']")

# This test fails when running more than 3 threads
# Note:  The 'data' variable above is fully loaded with good html.  
#        If we were running too many threads, 
#           I would expect to get an exception in getting the response or 'data' would be empty or corrupt.
#        But 'data' is good and the tree.xpath fails as its not fetching any data.
if len(line_numbers) == 0:   
    raise Exception("len(line_numbers) == 0.  There should be at least one")

This problem is somewhat random where it can fail when running 4 or more threads, but one time it worked great running 10 threads at the same time with no errors. it scraped 850 pages in 160 seconds. 5.3 pages per second.

If we can fix the xpath problem I expect we can do better than that.

Thank you.

ibreeden · Apr-28-2021, 07:09 AM

(Apr-28-2021, 05:09 AM)spacedog Wrote: I get problem accessing the data with xpath

What kind of problem? Is there an error message? Do you get the wrong data?

spacedog · (This post was last modified: Apr-28-2021, 03:48 PM by spacedog.)

The data is OK and pulls as expected from "requests.get". The problem is that xpath returns nothing.

data is OK here

data = response.text

tree = html.fromstring(data)
line_numbers = tree.xpath("//h2[@class='n']")

line_numbers is None here where normally it has a list of elements.

More info:

using VS Code and it's Watch window I see the variable "tree" looks like it's also OK"

len(tree.body):12

If there was a problem with "tree" I would expect the value to be 0 or None. When I put the xpath expression in the watch window it shows an empty list:

tree.xpath("//h2[@class='n']//a//@href"): []

I hope this helps to clarify.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Concurrent futures threading running at same speed as non-threading	billykid999	13	1,812	May-03-2023, 08:22 AM Last Post: billykid999
	Tutorials on sockets, threading and multi-threading?	muzikman	2	2,115	Oct-01-2021, 08:32 PM Last Post: muzikman
	Embedding python cause crash when use boost::asio multi threading	udvatt108	0	1,715	Oct-04-2020, 03:15 PM Last Post: udvatt108
	Multi-threading	Evil_Patrick	2	23,996	Jul-15-2020, 09:55 AM Last Post: snippsat
	object base multi threading	maboobelahi93	0	1,427	Jan-29-2020, 11:21 AM Last Post: maboobelahi93
	multi-threading error in minimal script	Skaperen	2	4,663	Aug-03-2019, 07:58 PM Last Post: Skaperen
	Problem with Python, MySQL and Multi-threading queries	zagk	1	11,881	Jul-01-2017, 12:15 AM Last Post: zagk
	SSH Multi Threading	nizami22247	0	9,587	Mar-31-2017, 06:04 PM Last Post: nizami22247

Need help multi-threading scraping

User Panel Messages

Announcements