File Content comparison-csv

JP_ROMANO · (This post was last modified: Jun-14-2017, 04:50 PM by JP_ROMANO.)

Good day - I thought I was going to be able to put my continuous python failures behind me for another month or so, but I've received a new task at work that *may* be easier than what I was trying to do before.

Unlike prior efforts, I'm not looking for an answer, but am hoping for some guidance (what libraries and which functions to investigate).

Here's the problem I'm trying to solve:
I have two CSV files with different row counts. I have deleted all of the columns to remove the complexity of dealing with a grid or table or whatever it may be called in Python. So I'm left with two files each with a single column of data. I have truncated, capitalized, ensured that the data are all alpha only strings, de-duped, and removed all non printable characters to avoid receipt of confusing error messages. I THINK my data is as clean and simple as I can make it.

I would like to end up with 3 things (but I'd be thrilled if I can make any one of them happen) in an output file (which can be a text file or csv or anything else I can open and read or send to a printer):

1) What is in File A that is not in File B
2) What is in File B that is not in File A
3) What is common to both File A and File B

I copied some code that I found in a colleague's old repository, but it ouputs an empty file. I'll add it here in case the best approach is to simply modify her script.

And if it matters, I believe I'm using 3.5.1 and PyCharm as the editor.

import pandas
with open('C:\Temp\INDU.csv', 'r') as file1:
    with open('C:\Temp\MOV.csv', 'r') as file2:
        same = set(file1).intersection(file2)

same.discard('\t')

with open('C:\Temp\CompareResults.csv', 'w') as file_out:
    for line in same:
        file_out.write(line)

Thank you for any suggestions!

***Ofnuts*** · Jun-15-2017, 09:30 AM

You won't go very far if you don't try to understand what is going on in the code.

The code above use sets, on which you can perform intersection and difference.

There is one trick in the code above, you can construct a set from a sequence, and an open file object being a sequence of all the lines, set(file) creates a set with every line in the file. However, these lines include or not the \n delimiter so reading them directly in the set without sanitizing them first can be dangerous. The output file is also created with the hope that the line delimiters are all there already.

JP_ROMANO · Jun-15-2017, 12:48 PM

Ofnuts - thank you for the reply and the explanation - really appreciate it.

Yes, I've tried to understand the code. In all honesty, I've done codeacademy and treehouse tutorials, watched dozens of youtube videos, have gone through page after page of "help" on different python.org sites and reached out to people in Stackoverflow.com. I'm currently midway through another series of tutorials by O'Reilly. I find Python horrifically unintuitive and the supporting documentation written by developers for developers, not people without PhD's. I build desktop solutions using VBA and SQL, mostly, so it could be that I've been spoiled by the simplicity of the syntax over the years.

In any event, while I'm grateful for your time and guidance, I'm not quite sure if you're saying that the above code can be made to work, or if I'm barking up the wrong tree. I'll try to figure out what the difference is between sequences and lists, and what intersection and differences are, and what library I have to import to use them (if any).

Thank you again and have a great day.

JP_ROMANO · Jun-15-2017, 02:37 PM

Okay - I got it mostly working

#Tell Python what files you want to compare
FILE1 = open('C:\\TEMP\\INDU.csv').readlines()
FILE2 = open('C:\\TEMP\\MOV.csv').readlines()

#Create a 'LIST' for each file, and strip out the new line breaks (\n)
list1 = []
for t in FILE1:
    t1 = t.rstrip('\n')
    list1.append(t1)

list2=[]
for tickers in FILE2:
    t2 = tickers.rstrip('\n')
    list2.append(t2)

#Now that there are two lists, just remove one from the other, in either direction

# What is in list 1 that is not in list 2
print(set(list1)- set(list2))
# What is in list 2 that is not in list 1
print(set(list2)- set(list1))
# What is common to both lists
print(set(list1)& set(list2))

But now I'm trying to get it to write out to a csv file - and am getting this error:
TypeError: write() argument must be str, not set

So I added bytes(x,'UTF-8') but to no avail.

Any guidance?
THANKS

**nilamo** · Jun-15-2017, 04:50 PM

Sure, instead of writing the whole set, iterate over the set and write it one-by-one.

>>> list1 = ["spam", "eggs"]
>>> list2 = ["beef", "chicken", "spam"]
>>> left_only = set(list1) - set(list2)
>>> left_only
{'eggs'}
>>> with open("left_only.csv", "w") as out:
...    for item in left_only:
...       print(item, file=out)

While we're at it, there's no reason to build a whole list of the file if you're going to only use a set (which contains only unique values, so it uses less memory if there's duplicates).

left = set()

#Tell Python what files you want to compare
with open('C:\\TEMP\\INDU.csv') as file1:
   for row in file1:
       # strip() removes all whitespace, you don't need to specify specifically only newlines
       value = row.strip()
       left.add(value)

JP_ROMANO · Jun-15-2017, 05:34 PM

nilamo - works beautifully, thank you so much!

File Content comparison-csv

User Panel Messages

Announcements