Python Forum
Need help multi-threading scraping
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Need help multi-threading scraping
#1
Using windows 10 pro, python 3.8.8.

I have a long list of URLs to loop through to scrape. Each URL has up to 100 pages to navigate through.

This is the process:

Outer loop - loop thru the URLs (each one on a new thread)
Inner loop - loop thru the pages
For each page I use a new proxy and agent so I don't have to wait 20 seconds before going to the next page


When I start running more than 3 threads at the same time I get problem accessing the data with xpath. Here's he code on the inner loop and where I'm having problems:


response = requests.get(page_url, proxies=proxy_dict, headers=headers, timeout=5, verify=False)
time.sleep(.2) # I play around with this to see if longer sleeps give better success. Not really...
data = response.text

# Test to see if we pulled a page with data
if len(data) < 1000:     
    raise Exception("requests.get Succeeded but did not return a valid page")

#Scrape page here
tree = html.fromstring(data)
line_numbers = tree.xpath("//h2[@class='n']")

# This test fails when running more than 3 threads
# Note:  The 'data' variable above is fully loaded with good html.  
#        If we were running too many threads, 
#           I would expect to get an exception in getting the response or 'data' would be empty or corrupt.
#        But 'data' is good and the tree.xpath fails as its not fetching any data.
if len(line_numbers) == 0:   
    raise Exception("len(line_numbers) == 0.  There should be at least one")
This problem is somewhat random where it can fail when running 4 or more threads, but one time it worked great running 10 threads at the same time with no errors. it scraped 850 pages in 160 seconds. 5.3 pages per second.

If we can fix the xpath problem I expect we can do better than that.

Thank you.
Reply
#2
(Apr-28-2021, 05:09 AM)spacedog Wrote: I get problem accessing the data with xpath
What kind of problem? Is there an error message? Do you get the wrong data?
Reply
#3
The data is OK and pulls as expected from "requests.get". The problem is that xpath returns nothing.

data is OK here
data = response.text
tree = html.fromstring(data)
line_numbers = tree.xpath("//h2[@class='n']")
line_numbers is None here where normally it has a list of elements.

More info:

using VS Code and it's Watch window I see the variable "tree" looks like it's also OK"
len(tree.body):12
If there was a problem with "tree" I would expect the value to be 0 or None. When I put the xpath expression in the watch window it shows an empty list:

tree.xpath("//h2[@class='n']//a//@href"): []
I hope this helps to clarify.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Concurrent futures threading running at same speed as non-threading billykid999 13 1,812 May-03-2023, 08:22 AM
Last Post: billykid999
  Tutorials on sockets, threading and multi-threading? muzikman 2 2,115 Oct-01-2021, 08:32 PM
Last Post: muzikman
  Embedding python cause crash when use boost::asio multi threading udvatt108 0 1,715 Oct-04-2020, 03:15 PM
Last Post: udvatt108
  Multi-threading Evil_Patrick 2 24,008 Jul-15-2020, 09:55 AM
Last Post: snippsat
  object base multi threading maboobelahi93 0 1,427 Jan-29-2020, 11:21 AM
Last Post: maboobelahi93
  multi-threading error in minimal script Skaperen 2 4,663 Aug-03-2019, 07:58 PM
Last Post: Skaperen
  Problem with Python, MySQL and Multi-threading queries zagk 1 11,881 Jul-01-2017, 12:15 AM
Last Post: zagk
  SSH Multi Threading nizami22247 0 9,588 Mar-31-2017, 06:04 PM
Last Post: nizami22247

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020