Sep-12-2024, 02:34 PM
So we have a scenario where we need to compare 2 files and create a 3rd of the duplicates.
I cant provide actual files due to the sensitivity of the data, but here is the situation
Need to compare these 2 files and generate 2 new files for reprocessing.
#1 File Original (Cannot edit or manipulate in anyway)
#2 File Original but was accidently updated and now contains duplicate records that are mixed in with all records
#3 File Needs to contain only the records that were duplicate in both files (verification purposes)
#4 File Needs to be the clean version with NO duplicates. (This file should be identical , with the possibility of extra records to #1 file) This would then allow us to say that original records exists and there are X number of new records and can be safely reprocessed)
What I'm looking for is direction the best way to accomplish. I have used pandas before, is that the best or easiest way to accomplish this ask?
Is there a better package or tool to efficiently do this?
Not looking for examples at this time, just guidance on the proper tools to use and consider.
The one thing though is that there could potentially be double digit or triple digit files to process quickly. Meaning on a good day, could be just a few files to compare, on a really bad day, could be over 50 and really bad cases could be 100+
I cant provide actual files due to the sensitivity of the data, but here is the situation
Need to compare these 2 files and generate 2 new files for reprocessing.
#1 File Original (Cannot edit or manipulate in anyway)
#2 File Original but was accidently updated and now contains duplicate records that are mixed in with all records
#3 File Needs to contain only the records that were duplicate in both files (verification purposes)
#4 File Needs to be the clean version with NO duplicates. (This file should be identical , with the possibility of extra records to #1 file) This would then allow us to say that original records exists and there are X number of new records and can be safely reprocessed)
What I'm looking for is direction the best way to accomplish. I have used pandas before, is that the best or easiest way to accomplish this ask?
Is there a better package or tool to efficiently do this?
Not looking for examples at this time, just guidance on the proper tools to use and consider.
The one thing though is that there could potentially be double digit or triple digit files to process quickly. Meaning on a good day, could be just a few files to compare, on a really bad day, could be over 50 and really bad cases could be 100+