Python Forum
How to mark duplicate rows in pandas - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: How to mark duplicate rows in pandas (/thread-29642.html)



How to mark duplicate rows in pandas - Mekala - Sep-14-2020

Hi,
I have below pandas dataframe:

ID	Pop	SG	time	            Stg	Rank   Name
A.1	T1	A.0	2020-08-01 10:45:00	VG	1	   LA
A.2	T1	A.0	2020-08-02 10:45:34	VG	3	   NT
K.6	T1	K.0	2020-08-03 10:45:20	BN	5	   PX
A.2	T1	A.0	2020-08-04 13:03:55	VG	8	   BN
K.3	T1	K.0	2020-08-05 14:45:13	BN	1	   LA
K.7	T1	K.0	2020-08-06 15:45:43	BN	0	   NN
K.3	T1	K.0	2020-08-07 15:45:34	BN	3	   CK
A.2	T1	H.0	2020-08-08 16:45:00	PP	8	   BN
I want to mark if ID, Pop, SG,Stg same except time then mark is DUP, otherwise NOR

Desired output:

ID	Pop	SG	time	            Stg	Rank   Name	Status
A.1	T1	A.0	2020-08-01 10:45:00	VG	1	   LA	NOR
A.2	T1	A.0	2020-08-02 10:45:34	VG	3	   NT	NOR
K.6	T1	K.0	2020-08-03 10:45:20	BN	5	   PX	NOR
A.2	T1	A.0	2020-08-04 13:03:55	VG	8	   BN	DUP
K.3	T1	K.0	2020-08-05 14:45:13	BN	1	   LA	NOR
K.7	T1	K.0	2020-08-06 15:45:43	BN	0	   NN	NOR
K.3	T1	K.0	2020-08-07 15:45:34	BN	3	   CK	DUP
A.2	T1	H.0	2020-08-08 16:45:00	PP	8	   BN	NOR
any method in dataframe? please help.


RE: How to mark duplicate rows in pandas - scidam - Sep-15-2020

Whats about df.duplicated(['ID', 'Pop', 'SG','Stg'])?


RE: How to mark duplicate rows in pandas - Mekala - Sep-17-2020

I tried as below:

idx= df.duplicated(['ID', 'Pop', 'SG','Stg']).tolist()
indexes = [n for n,x in enumerate(idx) if x==True]
df['new_col']='NOR'
df['new_col'].iloc[indexes]='DUP'
but there is a warning as below:

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self._setitem_with_indexer(indexer, value)


RE: How to mark duplicate rows in pandas - scidam - Sep-17-2020

df.loc[df.duplicated(subset=['ID', 'Pop', 'SG','Stg'], keep=False), 'new_col'] = 'dup'
df.loc[~df.duplicated(subset=['ID', 'Pop', 'SG','Stg'], keep=False), 'new_col'] = 'Nor'