Python Forum

Hello,

I'm trying to download data from the scopus api via python (https://dev.elsevier.com/documentation/A...alAPI.wadl).
Since we're talking about hundreds of thousands of requests, this takes ages synchronously.
My goal is now to translate this to an asynchronous script while still respecting my key's limitation of 3 requests per second.

I calculated that a single request takes on average 0.66 seconds (outliers down to 0.2 seconds and up to 2 minutes).
That means that it takes 3x0.66=~2seconds for 3 requests and if I use my key optimally that'd be 1 second for 3 requests.

The foundation of my script is based on this: https://github.com/rednafi/think-async/b...request.py
However, my asynchronous script is slightly slower than the synchronous one (testing with 500 requests).

I can only guess that my issue has to do with my Semaphore & the amount of time I ask for the script to sleep.

My logic is that I am allowed three requests (semaphore = 3) per second (so asyncio.sleep(1)).
I don't really understand why that github script sleeps twice, 1.5 second in safe_make_request & 1 second in make_request.
I also don't understand the author's logic in this comment:

Quote:This script makes 30 GET requests to a URL. However, it sends them in a batch of
10 requests and sleeps for 2 seconds between subsequent requests. The effective
concurrency is roughly 3 requests per second.

Can someone explain how the author came to 3 requests per second if he sends 10 requests & sleeps for 2 seconds?
Am I making a silly mistake in my code or is asyncio/semaphore not the way forward?

My full code:

import asyncio
import time
import httpx
import os

async def request_author_async(query, headers, params):
    async with httpx.AsyncClient(verify=False, timeout=None) as client:
        response = await client.get(query, headers=headers, params=params)
        await asyncio.sleep(1)
        return response


async def safe_request_author_async(author_id, limit, headers, params):
    query = "https://api.elsevier.com/content/author/author_id/{}".format(author_id)
    async with limit:

        result = await request_author_async(query, headers, params)

        if limit.locked():
            # print("\nlimit reached, sleeping for 1 second...\n")
            await asyncio.sleep(1.1)

        return result


async def download_authors_async(headers, params):
    limit = asyncio.Semaphore(3)
    tasks = [safe_request_author_async(author_id, limit, headers, params) for author_id in author_ids]

    responses = await asyncio.gather(*tasks)
    for response in responses:
        authors_async.append(response.json())


if __name__ == '__main__':

    author_ids = ["10038760800", "10039696900", "10040274300", "10040489100"]

    Headers = {"X-ELS-Insttoken": ENTER_TOKEN_HERE,
               "X-ELS-APIKey": ENTER_APIKEY_HERE,,
               "Accept": "application/json"}
    Params = {"view": "METRICS"}

    authors_async = []
    start = time.time()
    asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())
    asyncio.run(download_authors_async(Headers, Params))
    print("time spent asynchronously: {:.1f} seconds".format(time.time() - start))

Thanks for any help!

Regards,
Mikis

mikisDW