Python Forum

Full Version: Match CSV files for difference
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hi guys!

I have a real-life problem and wanted to know if there is a way to do it in more efficient way. I have two CSV files I need to compare to see if there are any differences. Let's say we have a table:


My outcome will be: [B,D,F,G,R,H] because those values are either in file1 or in file2 - but not in both of them. The way I tackled this is I iterated through each row in file1 and file2 creating lists from them and got differences using:

diff = set(list1) - set(list2)

The problem is, both files are containing almost 100k records each and it takes an awful lot of time to iterate through them. Is there a better way to work on big sets of data like this? I'm using csv library and Python 3.5.
I tried quickly with 100k records, and I got a very quick result, < one second.
Probably my test is wrong, so can you show your code and a sample of data ?
Ok, so maybe the reason is elsewhere. I'm using a code looking like this:

import csv, os
os.chdir(r"C:\Users\me\Desktop\compare files")
file1_list = []
file1 = open(r"file 1.csv")
file1_reader_obj = csv.reader(file1)
file1_data = list(file1_reader_obj)
for row in file1_data:
    x = file1_data.index(row)
And I just figured out I'm a moron since I already passed the file into list. So, I can use something like this to compare both files:
for row in file1_list:
    x = file1_list.index(row)
    if file1_list[x][1] in file2_list:
        print (RPT0706_list[x][1])
Now the issue is that both files are structured like list of lists:
[['1', 'a', 'a', 'a'], ['2', 'b', 'b', 'b'], ['3', 'c', 'c', 'c'], ['4', 'd', 'd', 'd'], ['5', 'e', 'e', 'e'], ['6', 'f', 'f', 'f']]
And I just have to check if the first item in the inner list of file1 (e.g 1, 2, 3, etc.) is listed as first item somewhere within inner lists in file2.

Let me know if I'm not making any sense. I'm still learning an art of expressing your thoughts when it comes to programming issues :)
This is a problem:

for row in file1_data:
    x = file1_data.index(row)
The second line is constantly searching through the list. My first thought was that it's much better to use enumerate:

for x, row in enumerate(file1_data):
But then I read the third line. file1_data[x] is row. They're the same thing. Why go to all that trouble?

for row in file1_data:
Which is so simple it might as well be a list comprehension:

file1_list = [row[1] for row in file1_data]
Thanks! This really helps and I have fully working script now :)

I think that I completely misunderstood index() method.