Python Forum

Full Version: finding dupes (with a twist)
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3
OK, new to coding and python. I figure if I have a real live application I will learn quicker. So here is the problem I'm trying to solve. Let me know if you think I am on the right track.

I have an excel spreadsheet with two columns: Division and Email address. There are three possible values for Division: Electrical, Automotive, or Fuses. I want to find all duplicate email addresses. EXCEPT that if one of the dupes is in the Electrical division AND the other is in the Fuses division, that's OK, not classified as a dupe.

So not knowing the terminology yet, here's what I think my approach will be:

sort the excel file by email address first
compare email address x+1 to x
if it matches (dupe), then compare Division(x+1) to Division(x)
if the Divisions match, it is a dupe and send email address to Dupe file
Increment x and loop back

I think my logic is correct, I guess my question is this logic and process suitable for a python program or am I barking up the wrong tree?
You have the right idea. Python's a general purpose language so it can do pretty much anything a computer is capable of doing. The logic is sound though you'll need at least one try...except block for indexing errors.

If your interest is cleaning out the duplicates, you could use a dictionary of sets to eliminate duplicates and then rebuild the Excel sheet with the clean data. You'd only have to iterate once and sorting would not be required.
Excellent, thanks for the reply! I'll definitely research the try...except block. And indexing errors.

The excel data is an output from another program (CRM). Once the dupes are identified, they will need to be researched before action is taken, so a separate excel with just the dupes would be ideal. Could I still accomplish that with the dictionary concept? I like the idea of not having to sort first.
You can read the excel into a list of tuples (division, e-mail) and pass it to collections.Counter and get how many times each pair is in the list..
Of course that is just one way to do it.
OK, cool. So then I guess there is a way I can return a list of every pair that occurs more than once?
Haven't come across collections.counter in my studies yet....more research required!
To make it more simple, I would save the excel as csv. Then you can just use standard library, no need to install third-party package for reading the excel file.
https://docs.python.org/3.8/library/coll...ns.Counter

Of course your problem to find the dupes can be solved with simple pivot table in excel, no code at all
Oh, wow. I would have never thought of using a pivot table. That solved my problem in less than 5mins.

Thank you!

And damn you! Now I have to find another problem to solve! LOL! It must be hard to find data problems to solve when excel is so darn powerful.
Still decided to pursue this. My thinking is that if I'm going to do it once a month a script will be more efficient then going thru the multiple steps required of the pivot table method. Plus with the script I can hand the task over to my assistant who doesn't have the pivot table knowledge.

Started learning how to import a csv and using tuples and collections. I'm finding it confusing, but I'll stick with it.

In the meanwhile, started looking into pandas, which rocked my world. I created my program with two lines of code!

df = pd.read_excel (r'contacts.xlsx')
dupes=df[df.duplicated(subset=['Division','EmailAddress'],keep=False)]

Wow!

But I still want to go back and do it old school so I can understand first principles. Wish me luck!
OK, super-pumped! I wrote code without Pandas. In writing this code I realized an error with my Pandas code: the dupe search was case sensitive, so I missed a bunch of dupes. Discovered .lower() in the code below and now properly capture them all. (I'm sure there's a way to do this in Pandas, just don't know what that is yet.

I am impressed with how far I have come in just a few weeks and really see the power of python to help me in data analysis.

I'm sure the code below looks quite clunky to you, so please let me know how to improve. Obviously the big weakness here is that the imported file MUST be sorted by email address. I would like to overcome that shortcoming!

I did try buran's suggestion above, "You can read the excel into a list of tuples (division, e-mail) and pass it to collections.Counter and get how many times each pair is in the list.."

The problem is that there are 5 other fields in each row that are required in the duplicate output in order to be able to do the followup research. After I found my division-email duplicates, I couldn't figure out how to repatriate the rest of the fields in those rows.

***EDIT- I see the tabs aren't showing up in my code. Sorry don't know how to fix that!***

CODE:
import csv

with open('Tdf contacts.csv', 'r') as read_obj:
    csv_reader = csv.reader(read_obj)
    list_of_rows = list(csv_reader)

email1=""
email2=""
division1=""
division2=""
dupeslist=[]

for row in list_of_rows:
    email2=row[4].lower()
    division2=row[0]
    if email2==email1:
        if division2==division1:       
            dupeslist.append(row)
        email1=email2
        division1=division2
    else:
        email1=email2
        division1=division2
        
file= open('new_dupes_file.csv','w',newline='')
with file:
    write=csv.writer(file)
    write.writerows(dupeslist)
Your code skips first occurrence of division, email combination.
I assume your csv file has hader, e.g.
Output:
Division,field2,field3,field4,Email Electrical,field2,field3,field4,[email protected] Automotive,field2,field3,field4,[email protected] Fuses,field2,field3,field4,[email protected] Electrical,field2,field3,field4,[email protected] Automotive,field2,field3,field4,[email protected] Fuses,field2,field3,field4,[email protected] Electrical,field2,field3,field4,[email protected] Automotive,field2,field3,field4,[email protected] Fuses,field2,field3,field4,[email protected]
using csv module
import csv
from collections import defaultdict

dupes = defaultdict(list)
with open('Tdf contacts.csv', 'r') as read_obj:
    csv_reader = csv.DictReader(read_obj)
    fieldnames = csv_reader.fieldnames
    for line in csv_reader:
        key = (line['Email'].lower(), line['Division'])
        dupes[key].append(line)

with open('new_dupes_file.csv', 'w') as f:
    wrtr = csv.DictWriter(f, fieldnames=fieldnames)
    wrtr.writeheader()
    for key, values in dupes.items():
        if len(values) > 1:
            wrtr.writerows(values)
using pandas
import pandas as pd 

df = pd.read_csv('Tdf contacts.csv')
df['Email'] = df['Email'].apply(lambda x: x.lower())
df['Duplicated'] = df.duplicated(subset=['Division', 'Email'], keep=False)
df_dupes = df[df['Duplicated']].sort_values(by=['Division', 'Email'])
df_dupes.drop(['Duplicated'], axis=1).to_csv('dupes.csv', index=False)
Pages: 1 2 3