Python Forum

Full Version: Pandas read csv file in 'date/time' chunks
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
I have written code to read a large time series data csv file (X million rows) using pandas read_csv() with chunking. That part of the code is working as expected, but unfortunately that is also a negative.

The problem is that I want to resample the data once it has been read, and because the data is being read in chunks, the resampling is working on a per chunk basis, rather than the file, so the boundaries of the chunked data can be different each time.

For instance, I have a file that contains minute data, and as there are 1440 minutes in a day, if I set the chunk size to 1440, in a perfect world, each chunk would contain data from 00:00 to 23:59. However, if there are minutes missing, reading 1440 rows would also end up reading all of the data for 1 day plus some data from the following day, and this is causing issues with the resampled data.

Is there a way to get pandas (or perhaps another library?), to read data one day/week/month at a time?

The only option I can currently think of is to split the large file into smaller day/week/month files, and then process those files without chunking.

I'm hoping there is a better solution to the one that I have thought of?
I'd give a thought to a Pandas alternative. There are several, and when you run into a pandas limitation (or speed issue) take a look.

Polars - listen to the recent Talk Python To Me Podcast for some details (episode 402)

Vaex - supports up to a billion rows

Dask

PySpark - Python wrapper for Spark which is written in scala, supports large datasets and distributed computing.
Put you data in a "real" datbase and work from there
(Feb-12-2023, 06:42 PM)jefsummers Wrote: [ -> ]I'd give a thought to a Pandas alternative. There are several, and when you run into a pandas limitation (or speed issue) take a look.

Polars - listen to the recent Talk Python To Me Podcast for some details (episode 402)

Vaex - supports up to a billion rows

Dask

PySpark - Python wrapper for Spark which is written in scala, supports large datasets and distributed computing.

Thanks for the feedback.

I'd already started coding up putting the data into a database and reading the data from there, but I will definitely look into your suggestions to see if they can do what I want in the future.
(Feb-12-2023, 08:39 PM)buran Wrote: [ -> ]Put you data in a "real" datbase and work from there

I'd already starting coding this up when I got your reply (great minds think alike Big Grin ). The reason I didn't go down this route originally is that I have a time constraint, i.e. I can't run processes that take too long. However, I have now factored this in, so hopefully all will be well once I have completed the coding.