![]() |
Compare fields from two csv files - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: Compare fields from two csv files (/thread-37014.html) |
Compare fields from two csv files - georgebijum - Apr-23-2022 I am quite new to the Python Programming and would appreciate your help on the below. Use case - I have Customer files from two sources .Each of them are huge and is having around 1 million records.Need to compare only selected attributes and if varying need to print data from both files to a log file.Looking for something similar to database joins with where clause.What is the best Python approach in achieving this as row by row processing doesn't look to be an ideal approach ? Example - In the below CustomerID is the joining key File-CustomerDetails1 CustomerID Active Country Industry Colunn5 Colunn6 Colunn7 1 Y SGP BNK 2 Y SGP MFG 3 N SGP BNK File-CustomerDetails2 CustomerID Active Country Industry Colunn5 Colunn6 Colunn7 1 Y SGP BNK 2 N SGP MFG 3 N SGP BNK 4 Y JPY BNK Expected Output in an Excel Tab#1 Get all the Customer records that have Matching values for Active, Country ,Industry -- Output should have all the attribs from Customer file Tab#2 Get all the Customer records that have varying values in of Active, Country ,Industry fields -- Output should capture the values from both files side by side ie.How to simulate an sql join like -- select t1.* from t1 join t2 on t1. CustomerID=t2.CustomerID and ( t1. Active <> t2.Active OR t1. Country <> t2.Country ) Looking mainly for the best approach only as the row by row processing seems expensive for 1 Million records Thanks in Advance RE: Compare fields from two csv files - DPaul - Apr-23-2022 Homework ? I would make lists -> sets -> "intersections". From there, easy to find what is common, what is not. Paul RE: Compare fields from two csv files - Pedroski55 - Apr-25-2022 I never have anything to do with large amounts of data. My longest csv is only about 200 lines long. Don't know how you could compare each line without calling each line! Maybe I have not understood what you really want. def myApp(): # read in 2 big old .csv files (in my case tiny) path2csv1 = '/home/pedro/myPython/csv/csv1.csv' with open(path2csv1) as CSV1: list1 = CSV1.readlines() path2csv2 = '/home/pedro/myPython/csv/csv2.csv' with open(path2csv2) as CSV2: list2 = CSV2.readlines() # first problem: the lists are not equal in length? if not len(list1) == len(list2): print('Problem, the lists are not the same length ... ') difference = 0 if len(list1) > len(list2): difference = len(list1) - len(list2) if len(list2) > len(list1): difference = len(list2) - len(list1) dlist = [] for d in range(0, difference): dlist.append('X') if len(list1) > len(list2): list2 = list2 + dlist if len(list2) > len(list1): list1 = list1 + dlist # check lengths if not len(list1) == len(list2): print('Problem, the lists are not the same length ... ') elif len(list1) == len(list2): print('The lists now have the same length, we can proceed ... ' # generate a list of tuples of dissimilar lines def getDisimilar_lines(): for i in range(len(list1)): if list1[i] != list2[i]: yield (list1[i], list2[i], f'line_{i}') not_the_same = getDisimilar_lines() for t in not_the_same: print(t) # easy to write to Excel from here RE: Compare fields from two csv files - Pedroski55 - Apr-25-2022 Not Python, but have a look here. This guy starts with a 50GB file! Once you chop up your big file into multiple small files, it will be much easier to Python! |