Python Forum

Full Version: jupyter pandas remove duplicates help
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hi there

Why after drop the duplicates, the no of row data still same?


data.shape
output:(15631, 12)

data1 = data
data1.sort_values(by=['MACHINERYSTATUS','DATECREATED'])
data1.drop_duplicates(['CNTRNO'], keep='last')
data1.shape
output: (15631, 12)
Try to test which duplicates python finds:

data1.duplicated()
or

print(data1.duplicated())
I have played around with this function now myself it's easy to confuse rows and columns.
Hi

after dropping the duplicates, it is still there

data.sort_values(by=['CNTRNO','DATECREATED'])
data.drop_duplicates(['CNTRNO'], keep='last')
data.duplicated('CNTRNO')
Output:
0 False 1 False 2 True 3 True 4 True 5 False 6 True 7 True 8 True 9 True 10 True 11 True 12 True
Hi again

Try and change your second line:
data.drop_duplicates(['CNTRNO'], keep='last')
To:
data3 = data.drop_duplicates(['CNTRNO'], keep='last')
And see how the new dataframe - a modified copy of 'data' behaves.
The 'data' df might be immutable or something similar - I'm not good with the programming lingo.

Another time you could consider showing a subset of your df graphically.