Bottom Page

Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
 Extracting Headers from Many Pages Quickly
#1
I am attempting to scrape a particular header from a collection of several thousand pages across multiple domains from behind a rather slow https proxy. I am attempting to optimize as best I can. So far, I'm using requests.head() for the actual connection, and multi-threading it to mitigate the proxy randomly not responding for a few seconds. My next plan is to try and leverage requests.Session to see if that makes the proxy happier. The issue is, I'm not sure how to safely thread that. I can't believe that Session is thread safe, but maybe I could assign a Session object per thread? How would I do that?

Am I massively overcomplicating this whole thing, and there's a better way? Opinions, please.
Quote
#2
(Aug-30-2019, 07:47 PM)OstermanA Wrote: and multi-threading it to mitigate the proxy randomly not responding for a few seconds.
This sounds like you should be using timeout instead.
Quote

Top Page

Possibly Related Threads...
Thread Author Replies Views Last Post
  pagination for non standarded pages zarize 12 259 Sep-02-2019, 12:35 PM
Last Post: zarize
  Protected Pages with Django xxp2 2 287 Feb-12-2019, 07:28 PM
Last Post: xxp2
  HTTP Headers as constants in stdlib kirans 9 585 Feb-03-2019, 03:38 AM
Last Post: kirans
  Scraping external URLs from pages Apook 5 1,112 Jul-18-2018, 06:42 PM
Last Post: nilamo
  scraping multiple pages of a website. Blue Dog 14 8,762 Jun-21-2018, 09:03 PM
Last Post: Blue Dog

Forum Jump:


Users browsing this thread: 1 Guest(s)