Feb-12-2023, 01:38 PM
I have written code to read a large time series data csv file (X million rows) using pandas read_csv() with chunking. That part of the code is working as expected, but unfortunately that is also a negative.
The problem is that I want to resample the data once it has been read, and because the data is being read in chunks, the resampling is working on a per chunk basis, rather than the file, so the boundaries of the chunked data can be different each time.
For instance, I have a file that contains minute data, and as there are 1440 minutes in a day, if I set the chunk size to 1440, in a perfect world, each chunk would contain data from 00:00 to 23:59. However, if there are minutes missing, reading 1440 rows would also end up reading all of the data for 1 day plus some data from the following day, and this is causing issues with the resampled data.
Is there a way to get pandas (or perhaps another library?), to read data one day/week/month at a time?
The only option I can currently think of is to split the large file into smaller day/week/month files, and then process those files without chunking.
I'm hoping there is a better solution to the one that I have thought of?
The problem is that I want to resample the data once it has been read, and because the data is being read in chunks, the resampling is working on a per chunk basis, rather than the file, so the boundaries of the chunked data can be different each time.
For instance, I have a file that contains minute data, and as there are 1440 minutes in a day, if I set the chunk size to 1440, in a perfect world, each chunk would contain data from 00:00 to 23:59. However, if there are minutes missing, reading 1440 rows would also end up reading all of the data for 1 day plus some data from the following day, and this is causing issues with the resampled data.
Is there a way to get pandas (or perhaps another library?), to read data one day/week/month at a time?
The only option I can currently think of is to split the large file into smaller day/week/month files, and then process those files without chunking.
I'm hoping there is a better solution to the one that I have thought of?