Dec-21-2020, 04:49 PM
Hello,
Hope someone will be able to help me here, it’s been days Im trying to merge two dataframe with value that are not exactly the same.
To be able to do that Im using recordlinkage:
df1 : 8411 rows, 15 columns
df2 : 79 rows, 8columns
I have the invoice number in both dataframe but not written exactly the same way on both Df.
I want to add information from df1 to df2 using the column INV_NO but the information is not written properly in both columns :
for example in df2 you have T49528 and in df1 you have ST49528
an other example df2 you have 2587 and df1 you have C5635-002587.
my code so far :
I really don’t understand why. If someone can help me please.
Is there a way to merge information between those 2 dataframes with a value that is not exactly the same in both dataframe ?
Also, I know there is something wrong in my merge because I have duplicate rows in the file.
What is the proper way to merge please ?
Thanks in advance
(Is there a way to add files to the post ?)
Hope someone will be able to help me here, it’s been days Im trying to merge two dataframe with value that are not exactly the same.
To be able to do that Im using recordlinkage:
df1 : 8411 rows, 15 columns
df2 : 79 rows, 8columns
I have the invoice number in both dataframe but not written exactly the same way on both Df.
I want to add information from df1 to df2 using the column INV_NO but the information is not written properly in both columns :
for example in df2 you have T49528 and in df1 you have ST49528
an other example df2 you have 2587 and df1 you have C5635-002587.
my code so far :
import pandas as pd import recordlinkage df1 = pd.read_excel(r’C:…\File.xlsx’) df2 = pd.read_excel(r’C:…\File2.xlsx’) #to make sure all lines are at string format df2[‘INV_NO’] = df2[‘INV_NO’].astype(str) df1[‘INV_NO’] = df1[‘INV_NO’].astype(str) indexer = recordlinkage.Index() indexer.full() candidates = indexer.index(df2,df1) compare = recordlinkage.Compare() compare.string(‘INV_NO’,‘INV_NO’,threshold=0.85,label=‘inv’) features = compare.compute(candidates,df2,df1) #merge result with df2 matches = features[features['inv']==1] duplicate_rows = matches.index.get_level_values(1) df3=df1[df1.index.isin(duplicate_rows)] final_merge = df2.merge(df3,how='left',on='INV_NO') final_merge.to_csv('final_merge.csv')When I look at the result (features df) I can see the line with value 2587 is not there that means it did not find a match on df1 but I know there is one (C5635-002587).
I really don’t understand why. If someone can help me please.
Is there a way to merge information between those 2 dataframes with a value that is not exactly the same in both dataframe ?
Also, I know there is something wrong in my merge because I have duplicate rows in the file.
What is the proper way to merge please ?
Thanks in advance
(Is there a way to add files to the post ?)