Python Forum
finding dupes (with a twist)
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
finding dupes (with a twist)
#11
Wow! My mind is officially blown! Thanks so much for this. Not only can I use this, but I have so much I can learn from it.

Any tips on learning resources? I have a couple books I'm plowing through, and watching some online tutorials.

My next step with is particular project is incorporate an Exceptions list. After the research on the dupes is done, we will allow some dupes to stay in the system. So my idea is to keep an Exceptions list, then after the program finds the dupes it omits the Exceptions before writing the dupes_final file.
Reply
#12
I'm still at it folks! So I completed the second step by comparing an exceptions file to the dupes file and deleting those exceptions from the main dupes list to create dupes_cleaned.
It's taken two weeks of intense study, and multiple, multiple iterations to finally figure out it took only two lines of code!!! BUT, that intense study means that 1) these lines are MINE and MINE alone! and 2) I understand the theory behind it much better than if I'd just copied someone else's code.
Please let me know if I missed anything or if you would have taken a different approach.
exceptions=pd.read_csv('exceptions.csv')
exceptions['EmailAddress'] = exceptions['EmailAddress'].str.lower()
except_emails=exceptions['EmailAddress']

for e in except_emails:
    df_dupes=df_dupes[(df_dupes['EmailAddress'] != e)]

df_dupes.to_excel('dupes cleaned.xlsx', index = False)
Reply
#13
actually you can use pandas.DataFrame.drop_duplicates()
I will leave to you to figure out how to use it.
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#14
Interesting! So, I should be able to do it with one line of code and no for loop?

From the documentation, I was hoping that this would work:

df_dupes.drop_duplicates(subset=exceptions["EmailAddress"],keep=False)
But, it was a fail. I many need another hint!
Reply
#15
It's not working because you try to refer another DataFrame columns.
Output:
Division,field2,field3,field4,Email Electrical,field2,field3,field4,[email protected] Automotive,field2,field3,field4,[email protected] Fuses,field2,field3,field4,[email protected] Electrical,field2,field3,field4,[email protected] Automotive,field2,field3,field4,[email protected] Fuses,field2,field3,field4,[email protected] Electrical,field2,field3,field4,[email protected] Automotive,field2,field3,field4,[email protected] Fuses,field2,field3,field4,[email protected]
import pandas as pd 
 
df = pd.read_csv('Tdf contacts.csv')
df['Email'] = df['Email'].apply(lambda x: x.lower())
df.drop_duplicates(subset=['Division', 'Email'], keep=False).to_csv('dupes_cleaned.csv', index=False)
cleaned:
Output:
Division,field2,field3,field4,Email Automotive,field2,field3,field4,[email protected] Automotive,field2,field3,field4,[email protected] Electrical,field2,field3,field4,[email protected] Automotive,field2,field3,field4,[email protected] Fuses,field2,field3,field4,[email protected]
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#16
I might not have explained my new goal clearly enough.
I already found my dupes contained in Tdf_contacts.csv using the code you gave me in post #10 (repeated in post #15).
NOW I have a list of "exceptions"- that is certain dupes that I am allowing.
So my 2nd goal is to delete every row in dupes where dupes['EmailAddress'] = exceptions['EmailAddress']
(The exceptions df is a management-defined subset of Tdf_contacts and will have all the same fields)
dupes_cleaned will be a subset dupes with the exceptions removed.
My code with the for loop in post #12 accomplished that, but I was wondering if there is another way?
Reply
#17
Ah, my mistake

import pandas as pd 
 
df_dupes = pd.read_csv('dupes.csv') # dupes
dfe = pd.read_csv('exceptions.csv') # the exceptions
# make sure email is lower case
df_dupes['Email'] = df_dupes['Email'].apply(lambda x: x.lower())
dfe['Email'] = dfe['Email'].apply(lambda x: x.lower())

# get just the records from dupes where email not in exceptions
df_clean = df_dupes[~df_dupes['Email'].isin(dfe['Email'])]
print(df_clean)
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#18
"isin"? damn, who knew? wow! clearly my journey has barely begun. Thanks so much for this!
Reply
#19
pandas.DataFrame.isin()


~ will reverse it - not in
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#20
Hi, buran.

Still on my to do list is to dissect this line from your code:
df['Email'] = df['Email'].apply(lambda x: x.lower())
However, I in my travels, I have recently come across this approach:
df['Email'] = df['Email'].str.lower()
which is much easier for me to grasp.

What would you say are the relative pros & cons of each approach?
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020