Python Forum
Compare fields from two csv files
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Compare fields from two csv files
#1
I am quite new to the Python Programming and would appreciate your help on the below.

Use case - I have Customer files from two sources .Each of them are huge and is having around 1 million records.Need to compare only selected attributes and if varying need to print data from both files to a log file.Looking for something similar to database joins with where clause.What is the best Python approach in achieving this as row by row processing doesn't look to be an ideal approach ? Example - In the below CustomerID is the joining key

File-CustomerDetails1
CustomerID Active Country Industry Colunn5 Colunn6 Colunn7
1 Y SGP BNK
2 Y SGP MFG
3 N SGP BNK


File-CustomerDetails2
CustomerID Active Country Industry Colunn5 Colunn6 Colunn7
1 Y SGP BNK
2 N SGP MFG
3 N SGP BNK
4 Y JPY BNK

Expected Output in an Excel

Tab#1 Get all the Customer records that have Matching values for Active, Country ,Industry -- Output should have all the attribs from Customer file
Tab#2 Get all the Customer records that have varying values in of Active, Country ,Industry fields -- Output should capture the values from both files side by side
ie.How to simulate an sql join like --
select t1.* from t1 join t2 on t1. CustomerID=t2.CustomerID and ( t1. Active <> t2.Active OR t1. Country <> t2.Country )

Looking mainly for the best approach only as the row by row processing seems expensive for 1 Million records

Thanks in Advance
Reply
#2
Homework ?
I would make lists -> sets -> "intersections".
From there, easy to find what is common, what is not.
Paul
It is more important to do the right thing, than to do the thing right.(P.Drucker)
Better is the enemy of good. (Montesquieu) = French version for 'kiss'.
Reply
#3
I never have anything to do with large amounts of data. My longest csv is only about 200 lines long.

Don't know how you could compare each line without calling each line!

Maybe I have not understood what you really want.

def myApp():
    # read in 2 big old .csv files (in my case tiny)
    path2csv1 = '/home/pedro/myPython/csv/csv1.csv'
    with open(path2csv1) as CSV1:
        list1 = CSV1.readlines()

    path2csv2 = '/home/pedro/myPython/csv/csv2.csv'
    with open(path2csv2) as CSV2:
        list2 = CSV2.readlines()        

    # first problem: the lists are not equal in length?
    if not len(list1) == len(list2):
        print('Problem, the lists are not the same length ... ')
    
    difference = 0   
    if len(list1) > len(list2):
        difference = len(list1) - len(list2)
    if len(list2) > len(list1):
        difference = len(list2) - len(list1)

    dlist = []
    for d in range(0, difference):
        dlist.append('X')

    if len(list1) > len(list2):
        list2 = list2 + dlist
    if len(list2) > len(list1):
        list1 = list1 + dlist   

    # check lengths
    if not len(list1) == len(list2):
        print('Problem, the lists are not the same length ... ')
    elif len(list1) == len(list2):
        print('The lists now have the same length, we can proceed ... '

    # generate a list of tuples of dissimilar lines    
    def getDisimilar_lines():
        for i in range(len(list1)):
            if list1[i] != list2[i]:
                yield (list1[i], list2[i], f'line_{i}')

    not_the_same = getDisimilar_lines()

    for t in not_the_same:
            print(t)    
    # easy to write to Excel from here
Reply
#4
Not Python, but have a look here.

This guy starts with a 50GB file!

Once you chop up your big file into multiple small files, it will be much easier to Python!
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Compare folder A and subfolder B and display files that are in folder A but not in su Melcu54 3 562 Jan-05-2024, 05:16 PM
Last Post: Pedroski55
  Compare 2 files tslavov 2 956 Feb-12-2023, 10:53 AM
Last Post: ibreeden
  Compare filename with folder name and copy matching files into a particular folder shantanu97 2 4,509 Dec-18-2021, 09:32 PM
Last Post: Larz60+
  Json fields not being retrieved mrcurious2020 4 2,065 Sep-14-2020, 06:24 AM
Last Post: bowlofred
  How can I compare Python XML-Files and add missing values from one to another kirat 2 2,698 Aug-30-2019, 12:17 PM
Last Post: perfringo
  Compare two large CSV files for a match Python_Newbie9 3 5,820 Apr-22-2019, 08:49 PM
Last Post: ichabod801
  Keyword compare in two files and output the results kotigasp 2 2,848 Jan-23-2018, 08:13 AM
Last Post: buran
  Mapping Fields zak308 0 2,501 Jan-09-2018, 10:02 PM
Last Post: zak308

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020