Python Forum

Full Version: Huge CSV Reading and Sorting for comparison
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hello,
I have a requirement where I have two sets of CSV files(unsorted).
CSV are huge (>10 million records and > 50 columns).

I need to compare these files and highlight differences such as -
1. Rows those are not matching
2. Values those are not matching in those rows.

I have tried filecmp. However, it works on sorted data only.

So could you please suggest any way in Python to efficiently achieve above requirement.

Thanks in Advance
Quote:I have tried filecmp. However, it works on sorted data only.
how do you expect to compare files if not sorted?
Files are so big that reading those files and sorting are taking lot of time.
Also, there are around 300 such files with different file structures(columns and rows).
So could you please suggest how to sort such files efficiently in short time. So that filecmp can be used.

Thanks,
Another way that I can see, is to make a pass through each file, collect all of the unique keys into a table, along with their disk addresses (using 'tell'), then sort the table (on keys, you can create a static hash on the key set if index is large) then with indexes sorted, compare file records using index file address in table to 'seek' records with same keys from each file and compare.