![]() |
finding dupes (with a twist) - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: finding dupes (with a twist) (/thread-26340.html) |
finding dupes (with a twist) - Lyle - Apr-28-2020 OK, new to coding and python. I figure if I have a real live application I will learn quicker. So here is the problem I'm trying to solve. Let me know if you think I am on the right track. I have an excel spreadsheet with two columns: Division and Email address. There are three possible values for Division: Electrical, Automotive, or Fuses. I want to find all duplicate email addresses. EXCEPT that if one of the dupes is in the Electrical division AND the other is in the Fuses division, that's OK, not classified as a dupe. So not knowing the terminology yet, here's what I think my approach will be: sort the excel file by email address first compare email address x+1 to x if it matches (dupe), then compare Division(x+1) to Division(x) if the Divisions match, it is a dupe and send email address to Dupe file Increment x and loop back I think my logic is correct, I guess my question is this logic and process suitable for a python program or am I barking up the wrong tree? RE: finding dupes (with a twist) - stullis - Apr-28-2020 You have the right idea. Python's a general purpose language so it can do pretty much anything a computer is capable of doing. The logic is sound though you'll need at least one try...except block for indexing errors. If your interest is cleaning out the duplicates, you could use a dictionary of sets to eliminate duplicates and then rebuild the Excel sheet with the clean data. You'd only have to iterate once and sorting would not be required. RE: finding dupes (with a twist) - Lyle - Apr-28-2020 Excellent, thanks for the reply! I'll definitely research the try...except block. And indexing errors. The excel data is an output from another program (CRM). Once the dupes are identified, they will need to be researched before action is taken, so a separate excel with just the dupes would be ideal. Could I still accomplish that with the dictionary concept? I like the idea of not having to sort first. RE: finding dupes (with a twist) - buran - Apr-29-2020 You can read the excel into a list of tuples (division, e-mail) and pass it to collections.Counter and get how many times each pair is in the list..Of course that is just one way to do it. RE: finding dupes (with a twist) - Lyle - Apr-29-2020 OK, cool. So then I guess there is a way I can return a list of every pair that occurs more than once? Haven't come across collections.counter in my studies yet....more research required! RE: finding dupes (with a twist) - buran - Apr-30-2020 To make it more simple, I would save the excel as csv. Then you can just use standard library, no need to install third-party package for reading the excel file. https://docs.python.org/3.8/library/collections.html#collections.Counter Of course your problem to find the dupes can be solved with simple pivot table in excel, no code at all RE: finding dupes (with a twist) - Lyle - Apr-30-2020 Oh, wow. I would have never thought of using a pivot table. That solved my problem in less than 5mins. Thank you! And damn you! Now I have to find another problem to solve! LOL! It must be hard to find data problems to solve when excel is so darn powerful. RE: finding dupes (with a twist) - Lyle - May-09-2020 Still decided to pursue this. My thinking is that if I'm going to do it once a month a script will be more efficient then going thru the multiple steps required of the pivot table method. Plus with the script I can hand the task over to my assistant who doesn't have the pivot table knowledge. Started learning how to import a csv and using tuples and collections. I'm finding it confusing, but I'll stick with it. In the meanwhile, started looking into pandas, which rocked my world. I created my program with two lines of code! df = pd.read_excel (r'contacts.xlsx') dupes=df[df.duplicated(subset=['Division','EmailAddress'],keep=False)] Wow! But I still want to go back and do it old school so I can understand first principles. Wish me luck! RE: finding dupes (with a twist) - Lyle - May-11-2020 OK, super-pumped! I wrote code without Pandas. In writing this code I realized an error with my Pandas code: the dupe search was case sensitive, so I missed a bunch of dupes. Discovered .lower() in the code below and now properly capture them all. (I'm sure there's a way to do this in Pandas, just don't know what that is yet. I am impressed with how far I have come in just a few weeks and really see the power of python to help me in data analysis. I'm sure the code below looks quite clunky to you, so please let me know how to improve. Obviously the big weakness here is that the imported file MUST be sorted by email address. I would like to overcome that shortcoming! I did try buran's suggestion above, "You can read the excel into a list of tuples (division, e-mail) and pass it to collections.Counter and get how many times each pair is in the list.." The problem is that there are 5 other fields in each row that are required in the duplicate output in order to be able to do the followup research. After I found my division-email duplicates, I couldn't figure out how to repatriate the rest of the fields in those rows. ***EDIT- I see the tabs aren't showing up in my code. Sorry don't know how to fix that!*** CODE: import csv with open('Tdf contacts.csv', 'r') as read_obj: csv_reader = csv.reader(read_obj) list_of_rows = list(csv_reader) email1="" email2="" division1="" division2="" dupeslist=[] for row in list_of_rows: email2=row[4].lower() division2=row[0] if email2==email1: if division2==division1: dupeslist.append(row) email1=email2 division1=division2 else: email1=email2 division1=division2 file= open('new_dupes_file.csv','w',newline='') with file: write=csv.writer(file) write.writerows(dupeslist) RE: finding dupes (with a twist) - buran - May-12-2020 Your code skips first occurrence of division, email combination. I assume your csv file has hader, e.g. using csv moduleimport csv from collections import defaultdict dupes = defaultdict(list) with open('Tdf contacts.csv', 'r') as read_obj: csv_reader = csv.DictReader(read_obj) fieldnames = csv_reader.fieldnames for line in csv_reader: key = (line['Email'].lower(), line['Division']) dupes[key].append(line) with open('new_dupes_file.csv', 'w') as f: wrtr = csv.DictWriter(f, fieldnames=fieldnames) wrtr.writeheader() for key, values in dupes.items(): if len(values) > 1: wrtr.writerows(values)using pandas import pandas as pd df = pd.read_csv('Tdf contacts.csv') df['Email'] = df['Email'].apply(lambda x: x.lower()) df['Duplicated'] = df.duplicated(subset=['Division', 'Email'], keep=False) df_dupes = df[df['Duplicated']].sort_values(by=['Division', 'Email']) df_dupes.drop(['Duplicated'], axis=1).to_csv('dupes.csv', index=False) |