Apr-28-2021, 05:09 AM
Using windows 10 pro, python 3.8.8.
I have a long list of URLs to loop through to scrape. Each URL has up to 100 pages to navigate through.
This is the process:
Outer loop - loop thru the URLs (each one on a new thread)
Inner loop - loop thru the pages
For each page I use a new proxy and agent so I don't have to wait 20 seconds before going to the next page
When I start running more than 3 threads at the same time I get problem accessing the data with xpath. Here's he code on the inner loop and where I'm having problems:
response = requests.get(page_url, proxies=proxy_dict, headers=headers, timeout=5, verify=False)
time.sleep(.2) # I play around with this to see if longer sleeps give better success. Not really...
data = response.text
If we can fix the xpath problem I expect we can do better than that.
Thank you.
I have a long list of URLs to loop through to scrape. Each URL has up to 100 pages to navigate through.
This is the process:
Outer loop - loop thru the URLs (each one on a new thread)
Inner loop - loop thru the pages
For each page I use a new proxy and agent so I don't have to wait 20 seconds before going to the next page
When I start running more than 3 threads at the same time I get problem accessing the data with xpath. Here's he code on the inner loop and where I'm having problems:
response = requests.get(page_url, proxies=proxy_dict, headers=headers, timeout=5, verify=False)
time.sleep(.2) # I play around with this to see if longer sleeps give better success. Not really...
data = response.text
# Test to see if we pulled a page with data if len(data) < 1000: raise Exception("requests.get Succeeded but did not return a valid page") #Scrape page here tree = html.fromstring(data) line_numbers = tree.xpath("//h2[@class='n']") # This test fails when running more than 3 threads # Note: The 'data' variable above is fully loaded with good html. # If we were running too many threads, # I would expect to get an exception in getting the response or 'data' would be empty or corrupt. # But 'data' is good and the tree.xpath fails as its not fetching any data. if len(line_numbers) == 0: raise Exception("len(line_numbers) == 0. There should be at least one")This problem is somewhat random where it can fail when running 4 or more threads, but one time it worked great running 10 threads at the same time with no errors. it scraped 850 pages in 160 seconds. 5.3 pages per second.
If we can fix the xpath problem I expect we can do better than that.
Thank you.