Python Forum

Full Version: HTTP Header request, how to improve efficiency
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Just trying to figure out how to improve the efficiency on a routine that uses Scrapy for fetching URLs.
Although caching is in place, the more the URLs the longer the time I need to complete the operation and for large list of URLs the processing time is becoming unacceptable.
Multithreading has introduced some benefits, but I still far away from optimal performance.

What would you do to improve things? I initially thought at using the memoization technique, but being something new to me, I'm not sure whether benefits could be seen. The idea behind is that being memoization a mechanism able to store both the call and the computed result, I could see benefits in not accessing to the disk for checking whether a cached URL has been previously processed.



What are your thoughts? Any other advice is really appreciated.What are your thoughts? Any other advice is really appreciated.
Caching and memoization won't help if you have more unique URLs. Are the URLs you're accessing all on the same server / domain? For a particular server, you can maintain an opened HTTP connection - https://docs.python.org/2/library/httpli...Connection
You can also try something using http2 - https://hyper.readthedocs.io/en/latest/
@micseydel thanks for jumping into the conversation. Yes URLs are on the same server and are meant to be unique.
However, the list is regularly updated every time a new link is encountered and URLs previously crawled could be included.
A check is in place to avoid such a scenario, but yes the big bottleneck is the HTTP protocol.

As for the H/2 this may not help necessarily as not all the server out of there could solve this.
Persistent HTTP connection is already in place, in fact Scrapy has been using it in the retrieval framework as far as I understand.
Are you maxing out the bandwidth on the host or server machine? I was assuming that you're I/O bound and not CPU bound, is that correct? And I'm assuming you're not using so much memory as to be using disk swap or anything.
Correct I'm I/O bound, server resources (when this script will be moved into there) can be stretched in a pool. However, even using proxies, I may be stuck in processing the data.
So I was attempting to reduce the number of superfluous calls.
(Apr-24-2017, 06:56 PM)andreamoro Wrote: [ -> ]caching is in place
(Apr-28-2017, 08:38 AM)andreamoro Wrote: [ -> ]I was attempting to reduce the number of superfluous calls
I'm struggling to understand, what kind of superfluous calls are you trying to reduce on top of having caching in place? Could you give a concrete (code example) of the kind of caching you want help implementing?