Python Forum
Chunking and Sorting a large file
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Chunking and Sorting a large file
#1
Hi Everyone,

I am trying to read and sort a large text file (10 GBs) in chunks. The aim is to sort the data based on column 2. The following achieves reading the (huge) data but I am struggling to sort it. Can anyone help please? I can sort the individual chunks (via argsort) but I am don't know how to merge everything; outputting the final Nx4 sorted array (that I plan to store in HDF5 file).


filename = "file.txt"
nrows = sum(1 for line in open(filename)) # nrows in the file
ncols = 4 # no. of cols.

idata = np.empty((nrows, ncols), dtype=np.float32) # np array to extract the data
i = 0
chunks = pd.read_csv(filename, chunksize=10000,
                     names=['ch', 'tmstp', 'lt', 'rt'])
# chunks is the complete bulk and each chunk (10,000x4)
for chunk in chunks:
    m, _ = chunk.shape # m = 10,000
    idata[i:i+m, :] = chunk # chunk dataframe => np array idata
    i += m
print(idata) # contains all read data from file.txt 
Reply
#2
you need a sort routine that uses available memory.
take a look at: https://en.wikipedia.org/wiki/Merge_sort
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Data Sorting and filtering(From an Excel File) PY_ALM 0 1,012 Jan-09-2023, 08:14 PM
Last Post: PY_ALM
  Reading large crapy text file in anaconda to profile data syamatunuguntla 0 811 Nov-18-2022, 06:15 PM
Last Post: syamatunuguntla
  extracting sublist from a large multiple molecular file juliocollm 2 2,262 May-25-2020, 12:49 PM
Last Post: juliocollm
  Sorting a large CVS file DavidTheGrockle 1 2,012 Oct-31-2019, 12:32 PM
Last Post: ichabod801
  How to filter specific rows from large data file Ariane 7 8,143 Jun-29-2018, 02:43 PM
Last Post: gontajones
  access a very large file? As an array or as a dataframe? Angelika 5 4,861 May-18-2017, 08:15 AM
Last Post: Angelika

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020