Huge CSV Reading and Sorting for comparison

Huge CSV Reading and Sorting for comparison - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: Huge CSV Reading and Sorting for comparison (/thread-28755.html)

Huge CSV Reading and Sorting for comparison - akshaynimkar - Aug-02-2020

Hello,
I have a requirement where I have two sets of CSV files(unsorted).
CSV are huge (>10 million records and > 50 columns).

I need to compare these files and highlight differences such as -
1. Rows those are not matching
2. Values those are not matching in those rows.

I have tried filecmp. However, it works on sorted data only.

So could you please suggest any way in Python to efficiently achieve above requirement.

Thanks in Advance

RE: Huge CSV Reading and Sorting for comparison - Larz60+ - Aug-02-2020

Quote:I have tried filecmp. However, it works on sorted data only.

how do you expect to compare files if not sorted?

RE: Huge CSV Reading and Sorting for comparison - akshaynimkar - Aug-04-2020

Files are so big that reading those files and sorting are taking lot of time.
Also, there are around 300 such files with different file structures(columns and rows).
So could you please suggest how to sort such files efficiently in short time. So that filecmp can be used.

Thanks,

RE: Huge CSV Reading and Sorting for comparison - Larz60+ - Aug-04-2020

Another way that I can see, is to make a pass through each file, collect all of the unique keys into a table, along with their disk addresses (using 'tell'), then sort the table (on keys, you can create a static hash on the key set if index is large) then with indexes sorted, compare file records using index file address in table to 'seek' records with same keys from each file and compare.