Python Forum
Extracting Headers from Many Pages Quickly
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Extracting Headers from Many Pages Quickly
#1
I am attempting to scrape a particular header from a collection of several thousand pages across multiple domains from behind a rather slow https proxy. I am attempting to optimize as best I can. So far, I'm using requests.head() for the actual connection, and multi-threading it to mitigate the proxy randomly not responding for a few seconds. My next plan is to try and leverage requests.Session to see if that makes the proxy happier. The issue is, I'm not sure how to safely thread that. I can't believe that Session is thread safe, but maybe I could assign a Session object per thread? How would I do that?

Am I massively overcomplicating this whole thing, and there's a better way? Opinions, please.
Reply


Messages In This Thread
Extracting Headers from Many Pages Quickly - by OstermanA - Aug-30-2019, 07:47 PM

Possibly Related Threads…
Thread Author Replies Views Last Post
  HTTP Headers as constants in stdlib kirans 9 5,553 Feb-03-2019, 03:38 AM
Last Post: kirans

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020