Python Forum

Full Version: Need help multi-threading scraping
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Using windows 10 pro, python 3.8.8.

I have a long list of URLs to loop through to scrape. Each URL has up to 100 pages to navigate through.

This is the process:

Outer loop - loop thru the URLs (each one on a new thread)
Inner loop - loop thru the pages
For each page I use a new proxy and agent so I don't have to wait 20 seconds before going to the next page


When I start running more than 3 threads at the same time I get problem accessing the data with xpath. Here's he code on the inner loop and where I'm having problems:


response = requests.get(page_url, proxies=proxy_dict, headers=headers, timeout=5, verify=False)
time.sleep(.2) # I play around with this to see if longer sleeps give better success. Not really...
data = response.text

# Test to see if we pulled a page with data
if len(data) < 1000:     
    raise Exception("requests.get Succeeded but did not return a valid page")

#Scrape page here
tree = html.fromstring(data)
line_numbers = tree.xpath("//h2[@class='n']")

# This test fails when running more than 3 threads
# Note:  The 'data' variable above is fully loaded with good html.  
#        If we were running too many threads, 
#           I would expect to get an exception in getting the response or 'data' would be empty or corrupt.
#        But 'data' is good and the tree.xpath fails as its not fetching any data.
if len(line_numbers) == 0:   
    raise Exception("len(line_numbers) == 0.  There should be at least one")
This problem is somewhat random where it can fail when running 4 or more threads, but one time it worked great running 10 threads at the same time with no errors. it scraped 850 pages in 160 seconds. 5.3 pages per second.

If we can fix the xpath problem I expect we can do better than that.

Thank you.
(Apr-28-2021, 05:09 AM)spacedog Wrote: [ -> ]I get problem accessing the data with xpath
What kind of problem? Is there an error message? Do you get the wrong data?
The data is OK and pulls as expected from "requests.get". The problem is that xpath returns nothing.

data is OK here
data = response.text
tree = html.fromstring(data)
line_numbers = tree.xpath("//h2[@class='n']")
line_numbers is None here where normally it has a list of elements.

More info:

using VS Code and it's Watch window I see the variable "tree" looks like it's also OK"
len(tree.body):12
If there was a problem with "tree" I would expect the value to be 0 or None. When I put the xpath expression in the watch window it shows an empty list:

tree.xpath("//h2[@class='n']//a//@href"): []
I hope this helps to clarify.