Python Forum
Insert 10gb csv files into sql table via python
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Insert 10gb csv files into sql table via python
#1
Hi Team,

I want to insert CSV Records into sql table via python,
my csv file size is 10 gb,
I found this code on google, Can you suggest better way if someone has done this.



import pyodbc
import pandas as pd
import csv
import time

# Define the SQL Server login and password
sql_login = 'XXXX\\XXX-XXX-XXX'
sql_password = 'xxxxxxx'

# Connect to the SQL Server database
connection = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server};SERVER=DESKTOP-GQK64O6;DATABASE=Customer;UID=' + sql_login + ';PWD=' + sql_password + ';')
cursor = connection.cursor()

# Define the file path and chunk size
file_path = 'abc.csv'
chunk_size = 1000

# Start the timer
start_time = time.time()

# Create a pandas dataframe iterator to read the CSV file in chunks
csv_iterator = pd.read_csv(file_path, chunksize=chunk_size, sep="|")

# Loop through the dataframe iterator
for chunk in csv_iterator:
    # Get the field names for the chunk
    field_names = list(chunk.columns)

    # Concatenate the field names into a string
    field_names_concatenated = ",".join(field_names)

    # Create a list of values for the chunk
    values = chunk.values.tolist()

    # Create a string of percents based on the number of values in each row
    percents_concatenated = ",".join(len(field_names) * ["?"])

    # Create the SQL statement
    sql = f"INSERT INTO employee ({field_names_concatenated}) VALUES ({percents_concatenated})"

    # Execute the SQL statement for the chunk
    cursor.executemany(sql, values)
    connection.commit()

    print(cursor.rowcount, "details inserted")

# Calculate the elapsed time
elapsed_time = time.time() - start_time
print(f"Elapsed time: {elapsed_time:.2f} seconds")
Reply
#2
You can load CSV file from pandas directly to the database.
the csv data may need to be 'normalized', see https://python-forum.io/thread-38263-pos...#pid161978
This uses sqllite model, which is constructed like example here: https://python-forum.io/thread-24127.html
search for 'Database Model'

Once you have a model you can load table something like this:

df = pd.read_csv(fn)
df.to_sql(tablename, con=self.Model.engine, if_exists='replace')
Above code was taken from tutorial mentioned below:
DB model can easily be constructed by hand, or if you want to get fancy, the tutorial below shows how to automate everythng from model generation to database load.

see tutorial here if interested.
Reply
#3
(Apr-28-2023, 09:33 AM)mg24 Wrote: my csv file size is 10 gb,
For data this big is Dask or Polars better.
Dask DataFrame copies the pandas DataFrame API,so it will work the same as Pandas

Example with timing.
# pip install "dask[complete]"

import time
from dask import dataframe as dd

start = time.time()
df = dd.read_csv('large.csv')
end = time.time()
print(f"Total Time: {(end-start)} sec")
Just bye doing this so will Dask do a lot,eg a medium size .csv 230 mb,so dos Dask read it in 0.01-sec and Pandas read it in 6.2 sec.
Dask utilizes multiple CPU cores by internally chunking dataframe and process in parallel.
Example want to import 10 GB data in your eg 6 GB RAM or more RAM.
This can’t be achieved via Pandas since whole data in a single shot doesn’t fit into memory(without chunking up),but Dask can.
Dask instead of computing first, create a graph of tasks which says about how to perform that task.
It's lazy computation which means that Dask’s task scheduler creating a graph at first followed by computing that graph when requested.
Larz60+ likes this post
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Python Serial: How to read the complete line to insert to MySQL? sylar 1 833 Mar-21-2023, 10:06 PM
Last Post: deanhystad
  python insert blank line in logger mg24 1 2,861 Nov-02-2022, 08:36 AM
Last Post: snippsat
  Add\insert header row existing csv files mg24 0 724 Oct-05-2022, 06:11 AM
Last Post: mg24
  store all variable values into list and insert to sql_summary table mg24 3 1,164 Sep-28-2022, 09:13 AM
Last Post: Larz60+
  Insert into SQL Table only when Table is First Created? Extra 4 1,468 Jun-28-2022, 07:50 AM
Last Post: Pedroski55
  Insert a multiple constant value after header in csv file using python shantanu97 1 1,159 Apr-24-2022, 10:04 AM
Last Post: Pedroski55
  Load the data from multiple source files to one table amy83 2 2,599 Apr-27-2021, 12:33 AM
Last Post: Pedroski55
  insert row and write to xlsx in python scttfnch 0 2,025 Feb-28-2021, 01:19 AM
Last Post: scttfnch
  Insert into mysql through python LaKhWaN 0 1,954 Aug-26-2020, 04:54 AM
Last Post: LaKhWaN
  How do I insert images in Python using gspread (or any other package)? ivansing23 0 2,272 Jul-27-2020, 01:26 PM
Last Post: ivansing23

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020