Bottom Page

Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
 Chunking and Sorting a large file
#1
Hi Everyone,

I am trying to read and sort a large text file (10 GBs) in chunks. The aim is to sort the data based on column 2. The following achieves reading the (huge) data but I am struggling to sort it. Can anyone help please? I can sort the individual chunks (via argsort) but I am don't know how to merge everything; outputting the final Nx4 sorted array (that I plan to store in HDF5 file).


filename = "file.txt"
nrows = sum(1 for line in open(filename)) # nrows in the file
ncols = 4 # no. of cols.

idata = np.empty((nrows, ncols), dtype=np.float32) # np array to extract the data
i = 0
chunks = pd.read_csv(filename, chunksize=10000,
                     names=['ch', 'tmstp', 'lt', 'rt'])
# chunks is the complete bulk and each chunk (10,000x4)
for chunk in chunks:
    m, _ = chunk.shape # m = 10,000
    idata[i:i+m, :] = chunk # chunk dataframe => np array idata
    i += m
print(idata) # contains all read data from file.txt 
Quote
#2
you need a sort routine that uses available memory.
take a look at: https://en.wikipedia.org/wiki/Merge_sort
Quote

Top Page

Possibly Related Threads...
Thread Author Replies Views Last Post
  extracting sublist from a large multiple molecular file juliocollm 2 351 May-25-2020, 12:49 PM
Last Post: juliocollm
  Sorting a large CVS file DavidTheGrockle 1 335 Oct-31-2019, 12:32 PM
Last Post: ichabod801
  How to filter specific rows from large data file Ariane 7 4,287 Jun-29-2018, 02:43 PM
Last Post: gontajones
  access a very large file? As an array or as a dataframe? Angelika 5 2,217 May-18-2017, 08:15 AM
Last Post: Angelika

Forum Jump:


Users browsing this thread: 1 Guest(s)