Python Forum

Full Version: [Pandas] Help with finding only non-matching rows
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hello, new here and new to python. I am trying to write a function which will display only non-matched and/or missing rows between two datasets. Some of the datasets will be in the thousands of rows. Closest I can get thus far is with two pandas dataframes using merge, which shows all rows but any advice to only get non-matched or a better approach?

Example:

import pandas as pd
data1=[('id1','desc1'),('id2','desc2'),('id3','desc3')]
data2=[('id1','wrong description'),('id2','desc2')]
df1=pd.DataFrame(data1,columns=['ID','DESCRIPTION'])
df2=pd.DataFrame(data2,columns=['ID','DESCRIPTION'])
diff=pd.merge(df1,df2,how='outer',on=['ID'],suffixes=('_SRC','_TGT'),indicator=True)
print(diff)
Output:
ID DESCRIPTION_SRC DESCRIPTION_TGT 0 id1 desc1 wrong description 1 id2 desc2 desc2 2 id3 desc3 NaN
How would I get similar result with only row 0 and row 2?

Thank you.
import pandas as pd
data1=[('id1','desc1'),('id2','desc2'),('id3','desc3')]
data2=[('id1','wrong description'),('id2','desc2')]
df1=pd.DataFrame(data1,columns=['ID','DESCRIPTION'])
df2=pd.DataFrame(data2,columns=['ID','DESCRIPTION'])
print(pd.merge(df1,df2, indicator=True, how='outer')
         .query('_merge=="left_only"')
         .drop('_merge', axis=1))
Output:
ID DESCRIPTION 0 id1 desc1 2 id3 desc3
Does this work?
Thank you so much! It is closer, but I am wanting to capture the differences between source and target. I am wanting output something like:

Output:
ID DESCRIPTION_SRC DESCRIPTION_TGT 0 id1 desc1 wrong description 2 id3 desc3 NaN
import pandas as pd
data1=[('id1','desc1'),('id2','desc2'),('id3','desc3')]
data2=[('id1','wrong description'),('id2','desc2')]
df1=pd.DataFrame(data1,columns=['ID','DESCRIPTION'])
df2=pd.DataFrame(data2,columns=['ID','DESCRIPTION'])
df1a = df1.copy()
df1a['Desc2']=df2['DESCRIPTION']
df1a['Different'] = (df1a['DESCRIPTION'] != df1a['Desc2'])
df1a = df1a[df1a.Different == True]
df1a
Output:
ID DESCRIPTION Desc2 Different 0 id1 desc1 wrong description True 2 id3 desc3 NaN True
You can then strip away the Different column.