Python Forum
Match CSV files for difference
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Match CSV files for difference
#1
Hi guys!

I have a real-life problem and wanted to know if there is a way to do it in more efficient way. I have two CSV files I need to compare to see if there are any differences. Let's say we have a table:

...file1.csv.....file2.csv
.....A..............A
.....B..............C
.....C..............E
.....D..............F
.....E..............G
.....R..............Z
.....Z..............H

My outcome will be: [B,D,F,G,R,H] because those values are either in file1 or in file2 - but not in both of them. The way I tackled this is I iterated through each row in file1 and file2 creating lists from them and got differences using:

diff = set(list1) - set(list2)

The problem is, both files are containing almost 100k records each and it takes an awful lot of time to iterate through them. Is there a better way to work on big sets of data like this? I'm using csv library and Python 3.5.
Reply
#2
Hello,
I tried quickly with 100k records, and I got a very quick result, < one second.
Probably my test is wrong, so can you show your code and a sample of data ?
Reply
#3
Ok, so maybe the reason is elsewhere. I'm using a code looking like this:

import csv, os
os.chdir(r"C:\Users\me\Desktop\compare files")
file1_list = []
file1 = open(r"file 1.csv")
file1_reader_obj = csv.reader(file1)
file1_data = list(file1_reader_obj)
for row in file1_data:
    x = file1_data.index(row)
    file1_list.append(file1_data[x][1])

And I just figured out I'm a moron since I already passed the file into list. So, I can use something like this to compare both files:
for row in file1_list:
    x = file1_list.index(row)
    if file1_list[x][1] in file2_list:
        continue
    else:
        print (RPT0706_list[x][1])
Now the issue is that both files are structured like list of lists:
Output:
[['1', 'a', 'a', 'a'], ['2', 'b', 'b', 'b'], ['3', 'c', 'c', 'c'], ['4', 'd', 'd', 'd'], ['5', 'e', 'e', 'e'], ['6', 'f', 'f', 'f']]
And I just have to check if the first item in the inner list of file1 (e.g 1, 2, 3, etc.) is listed as first item somewhere within inner lists in file2.

Let me know if I'm not making any sense. I'm still learning an art of expressing your thoughts when it comes to programming issues :)
Reply
#4
This is a problem:

for row in file1_data:
    x = file1_data.index(row)
    file1_list.append(file1_data[x][1])
The second line is constantly searching through the list. My first thought was that it's much better to use enumerate:

for x, row in enumerate(file1_data):
    file1_list.append(file1_data[x][1])
But then I read the third line. file1_data[x] is row. They're the same thing. Why go to all that trouble?

for row in file1_data:
    file1_list.append(row[1])
Which is so simple it might as well be a list comprehension:

file1_list = [row[1] for row in file1_data]
Craig "Ichabod" O'Brien - xenomind.com
I wish you happiness.
Recommended Tutorials: BBCode, functions, classes, text adventures
Reply
#5
Thanks! This really helps and I have fully working script now :)

I think that I completely misunderstood index() method.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  python 3 find difference between 2 files pd007 2 402 May-22-2020, 01:16 AM
Last Post: Larz60+
  Look for match in two files and print out in the first file Batistuta 0 361 Mar-03-2020, 02:27 PM
Last Post: Batistuta
  Difference Between 2 files enigma619 3 558 Dec-21-2019, 01:39 PM
Last Post: Gribouillis
  How to match two CSV files timlamont 9 1,042 Oct-01-2019, 05:54 PM
Last Post: timlamont
  Open and read multiple text files and match words kozaizsvemira 2 2,187 Sep-11-2019, 12:58 PM
Last Post: kozaizsvemira
  Python Script to Produce Difference Between Files and Resolve DNS Query for the Outpu sultan 2 651 May-22-2019, 07:20 AM
Last Post: buran
  Compare two large CSV files for a match Python_Newbie9 3 2,462 Apr-22-2019, 08:49 PM
Last Post: ichabod801

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020