Chunking and Sorting a large file - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Data Science (https://python-forum.io/forum-44.html) +--- Thread: Chunking and Sorting a large file (/thread-28665.html) |
Chunking and Sorting a large file - Robotguy - Jul-28-2020 Hi Everyone, I am trying to read and sort a large text file (10 GBs) in chunks. The aim is to sort the data based on column 2. The following achieves reading the (huge) data but I am struggling to sort it. Can anyone help please? I can sort the individual chunks (via argsort) but I am don't know how to merge everything; outputting the final Nx4 sorted array (that I plan to store in HDF5 file). filename = "file.txt" nrows = sum(1 for line in open(filename)) # nrows in the file ncols = 4 # no. of cols. idata = np.empty((nrows, ncols), dtype=np.float32) # np array to extract the data i = 0 chunks = pd.read_csv(filename, chunksize=10000, names=['ch', 'tmstp', 'lt', 'rt']) # chunks is the complete bulk and each chunk (10,000x4) for chunk in chunks: m, _ = chunk.shape # m = 10,000 idata[i:i+m, :] = chunk # chunk dataframe => np array idata i += m print(idata) # contains all read data from file.txt RE: Chunking and Sorting a large file - Larz60+ - Jul-29-2020 you need a sort routine that uses available memory. take a look at: https://en.wikipedia.org/wiki/Merge_sort |