Extracting Headers from Many Pages Quickly

OstermanA · (This post was last modified: Aug-30-2019, 07:47 PM by OstermanA.)

I am attempting to scrape a particular header from a collection of several thousand pages across multiple domains from behind a rather slow https proxy. I am attempting to optimize as best I can. So far, I'm using requests.head() for the actual connection, and multi-threading it to mitigate the proxy randomly not responding for a few seconds. My next plan is to try and leverage requests.Session to see if that makes the proxy happier. The issue is, I'm not sure how to safely thread that. I can't believe that Session is thread safe, but maybe I could assign a Session object per thread? How would I do that?

Am I massively overcomplicating this whole thing, and there's a better way? Opinions, please.

***metulburr*** · Aug-31-2019, 11:43 AM

(Aug-30-2019, 07:47 PM)OstermanA Wrote: and multi-threading it to mitigate the proxy randomly not responding for a few seconds.

This sounds like you should be using timeout instead.

OstermanA · Oct-01-2019, 08:01 AM

So, a bit of necromancy, but it turns out that requests.Session() can build multiple thread pools with one for each domain you connect to, recycling pools if the number of domains exceeds the limit. Threads will fail to return results if you try to open more than the pool max size, but I was able to set the number of pools equal to the number of domains, spawn a thread for each domain, then have each domain thread spawn pool_max_size child threads. The final results were much faster than anything I've ever seen go through this network, so I was quite pleased. Unfortunately, I don't think I'll be allowed to share code as this was for work, but I hope this helps anyone who faces a similar issue in the future.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	HTTP Headers as constants in stdlib	kirans	9	5,529	Feb-03-2019, 03:38 AM Last Post: kirans

Extracting Headers from Many Pages Quickly

User Panel Messages

Announcements