Python Forum

Full Version: Merging sorted dataframes using Pandas
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
I have a large (Nx4, >10GB) array that I need to sort based on col.2.

I am reading my data in chunks and sorting using Pandas. But I am unable to combine the sorted chunks to give me a final large Nx4 array that is sorted on Col.2. I want this process to be as fast as possible as well. Here is what I have tried yet:

chunks = pd.read_csv(ifile[0], chunksize=50000, skiprows=0,
                     names=['col-1', 'col-2', 'col-3', 'col-4'])

for df in chunks:
    df = df.sort_values(by='col-2', kind='mergesort') # sorted chunks
    print(df)
Pandas may not be the tool for that. Personally, I would use SQL. Create a table that size, do a select query to order by the second column, write out the result set.

Just an idea.