Python Forum

I'm building a tool to do the initial import of data for another app. The data will be read or inferred from the contents of a json file.

The total number of rows inserted will be in the 10s of thousands total in our use-case, but I would like for it to be capable of handling at least 10 times that for larger deployments.

I'm looking for best-practices or general advice on how often to do a commit.

This is roughly what the code will look like:

with open(path.join(json_dir, 'base_data.json'), mode='r', encoding='utf-8-sig') as base_data_file:
    base_data = json.load(base_data_file)

    with psycopg2.connect(**db_connection_params) as db_conn:
        db_cursor = db_conn.cursor()
        
        # Import tags (TODO: See if this can be passed in as a single list of all tags)
        for tag in base_data['tags']:
            db_cursor.execute("""
                INSERT INTO api_tag (name, description) VALUES (%s, %s);
                """,
                (tag['name'], tag['description'])
            )
        db_conn.commit() # <- commit or wait?

        # Import brands
        # for brand in base_data['brands']...

        # Import DHCPServers

        # Import Subnets/Addresses

I don't expect a typical deployment to generate a great number of rows in the tags, brands, or DHCPServers sections, but Subnets/Addresses will generate a row for every subnet, and a row for every ip address that exists on-site. For my use, which is likely the only place it will ever be used, it will generate ~10,000 rows.

My thoughts are to commit after each block (tags, brands, DHCPServers) and then after each subnet in base_data['subnets'], which should each generate no more than a couple thousand rows for most realistic environments (~250 for a typical 24-bit subnet)

Ideally, I would like to be able to wait to commit until all queries have been successful. That would be the safest way to insure that I don't end up with a half-baked database after a query fails. However, I don't have a good enough low-level understanding of postgresql or the library to know where all this data sits before a commit, how much can sit there, and how much can be committed at one time.

I would appreciate any input from anyone who has a better grip on postgresql, or psycopg2.

Just after posting this I found the best practices section of the psycopg2 FAQ that reads...

Quote:When should I save and re-use a connection as opposed to creating a new one as needed?
Creating a connection can be slow (think of SSL over TCP) so the best practice is to create a single connection and keep it open as long as required. It is also good practice to rollback or commit frequently (even after a single SELECT statement) to make sure the backend is never left “idle in transaction”. See also psycopg2.pool for lightweight connection pooling.

Still would appreciate input, but I think my solution is frequent commits and just do more checks to allow it to continue seamlessly on successive runs against a half-built database.

acecase