Aug-12-2020, 06:01 PM
I have a large (Nx4, >10GB) array that I need to sort based on col.2.
I am reading my data in chunks and sorting using Pandas. But I am unable to combine the sorted chunks to give me a final large Nx4 array that is sorted on Col.2. I want this process to be as fast as possible as well. Here is what I have tried yet:
I am reading my data in chunks and sorting using Pandas. But I am unable to combine the sorted chunks to give me a final large Nx4 array that is sorted on Col.2. I want this process to be as fast as possible as well. Here is what I have tried yet:
1 2 3 4 5 6 |
chunks = pd.read_csv(ifile[ 0 ], chunksize = 50000 , skiprows = 0 , names = [ 'col-1' , 'col-2' , 'col-3' , 'col-4' ]) for df in chunks: df = df.sort_values(by = 'col-2' , kind = 'mergesort' ) # sorted chunks print (df) |