please help

natalie321 · (This post was last modified: Apr-10-2024, 03:52 AM by natalie321.)

I'm working on a Python script to parse a large CSV file and extract specific data, but I'm encountering performance issues. Here's a simplified version of my code:

        
              import csv
 
def extract_data(csv_file):
    with open(csv_file, 'r') as file:
        reader = csv.reader(file)
        next(reader)  # Skip header row
        for row in reader:
            # Extracting data from specific columns
            data = row[1], row[3], row[5]
            process_data(data)
 
def process_data(data):
    # Some processing on the extracted data
    print(data)
 
csv_file = 'large_file.csv'
extract_data(csv_file)

The problem is that large_file.csv contains millions of rows, and my script is taking too long to process. I've tried optimizing the code, but it's still not efficient enough. Can someone suggest more efficient ways to parse and extract data from such a large CSV file in Python? Any help would be appreciated!
build now gg

**Larz60+** · Apr-10-2024, 07:52 AM

you can read the data in as chunks, which will save a lot of time.
Also, since you want to diaplay the data, using Pandas may save significant time.
Perhaps reconsider displaying the entire file as no one will be able to read millions of lines anyway.
what is the ultimate goal?

**Gribouillis** · (This post was last modified: Apr-10-2024, 08:58 AM by Gribouillis.)

I'm not a pandas user but pandas.read_csv() seems to have a chunksize argument which allows you to read portions of the entire file.

paul18fr · Apr-10-2024, 09:22 AM

Pandas is the tool best that can deal with a lot of configurations; nonetheless in some cases, other "light" tools can be used (Numpy for instance).

Do you have the same number of columns in all rows?
Is always the same type (float / integer)?

=> if so, you can have a look to np.loadtxt to directly retrieve an array (basic example here after)

        
              M = np.loadtxt('data/sample.csv', delimiter=',', dtype='int64')

**deanhystad** · Apr-10-2024, 09:34 AM

I don't know if chunks are important or not. I think you would see significant speed gains if you used pandas to load your csv file and do your processing.

***snippsat*** · (This post was last modified: Apr-10-2024, 10:54 AM by snippsat.)

Can also use Polars to speed things up.
So for 1-GB file .csv Pandas use ca 13.5 seconds versus 350 milliseconds in Polars.

In pandas your could be like this.
Then if this is slow use Polars or and other opinion is Dask

        
              import pandas as pd
 
def extract_data(csv_file):
    # Use chunksize to read the file in chunks
    chunksize = 10000  
    for chunk in pd.read_csv(csv_file, chunksize=chunksize):  
        data = chunk.iloc[:, [1, 3, 5]].values.tolist()
        process_data(data)
 
def process_data(data):
    for row in data:
        print(row)
 
csv_file = 'large_file.csv'
extract_data(csv_file)

Pedroski55 · Apr-11-2024, 06:05 AM

Quote:I'm working on a Python script to parse a large CSV file and extract specific data,

Got a sample of your csv? What do you want to extract?

Is the first line column headers?

Look up Data Pipelines:

Quote:Creating Data Pipelines With Generators

Data pipelines allow you to string together code to process large datasets or streams of data without maxing out your machine’s memory. Imagine that you have a large CSV file:

kakocer · (This post was last modified: Jan-20-2025, 08:21 PM by Gribouillis.)

The code reads the file line by line, which is memory-efficient for large files. However, the data is printed directly, which could be slow for a very large file. If further processing is needed, you may want to accumulate the results in a list or database.
Link Removed

Gribouillis write Jan-20-2025, 08:21 PM:
Clickbait link removed. Please read What to NOT include in a post

please help

User Panel Messages

Announcements