Python Forum
Huge CSV Reading and Sorting for comparison
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Huge CSV Reading and Sorting for comparison
#1
Hello,
I have a requirement where I have two sets of CSV files(unsorted).
CSV are huge (>10 million records and > 50 columns).

I need to compare these files and highlight differences such as -
1. Rows those are not matching
2. Values those are not matching in those rows.

I have tried filecmp. However, it works on sorted data only.

So could you please suggest any way in Python to efficiently achieve above requirement.

Thanks in Advance
Reply
#2
Quote:I have tried filecmp. However, it works on sorted data only.
how do you expect to compare files if not sorted?
Reply
#3
Files are so big that reading those files and sorting are taking lot of time.
Also, there are around 300 such files with different file structures(columns and rows).
So could you please suggest how to sort such files efficiently in short time. So that filecmp can be used.

Thanks,
Reply
#4
Another way that I can see, is to make a pass through each file, collect all of the unique keys into a table, along with their disk addresses (using 'tell'), then sort the table (on keys, you can create a static hash on the key set if index is large) then with indexes sorted, compare file records using index file address in table to 'seek' records with same keys from each file and compare.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Exporting a huge dataFrame stylingpat 5 15,269 Mar-23-2021, 12:13 AM
Last Post: stylingpat
  Pip prints huge error when installing p5 hayden2s 1 2,217 Aug-08-2020, 02:30 PM
Last Post: snippsat
  How to scan huge files and make it in chunks ampai 2 2,542 May-28-2020, 08:20 PM
Last Post: micseydel
  convert huge xml to csv using python srikanta_p 2 1,985 Feb-08-2020, 07:16 PM
Last Post: srikanta_p
  huge list of whole numbers Skaperen 3 2,737 Jun-02-2019, 10:11 PM
Last Post: Skaperen
  Sorting a copied list is also sorting the original list ? SN_YAZER 3 2,996 Apr-11-2019, 05:10 PM
Last Post: SN_YAZER

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020