Python Forum
Chunking and Sorting a large file - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Data Science (https://python-forum.io/forum-44.html)
+--- Thread: Chunking and Sorting a large file (/thread-28665.html)



Chunking and Sorting a large file - Robotguy - Jul-28-2020

Hi Everyone,

I am trying to read and sort a large text file (10 GBs) in chunks. The aim is to sort the data based on column 2. The following achieves reading the (huge) data but I am struggling to sort it. Can anyone help please? I can sort the individual chunks (via argsort) but I am don't know how to merge everything; outputting the final Nx4 sorted array (that I plan to store in HDF5 file).


filename = "file.txt"
nrows = sum(1 for line in open(filename)) # nrows in the file
ncols = 4 # no. of cols.

idata = np.empty((nrows, ncols), dtype=np.float32) # np array to extract the data
i = 0
chunks = pd.read_csv(filename, chunksize=10000,
                     names=['ch', 'tmstp', 'lt', 'rt'])
# chunks is the complete bulk and each chunk (10,000x4)
for chunk in chunks:
    m, _ = chunk.shape # m = 10,000
    idata[i:i+m, :] = chunk # chunk dataframe => np array idata
    i += m
print(idata) # contains all read data from file.txt 



RE: Chunking and Sorting a large file - Larz60+ - Jul-29-2020

you need a sort routine that uses available memory.
take a look at: https://en.wikipedia.org/wiki/Merge_sort