Python Forum
Recordlinkage merge dataframe non exact value
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Recordlinkage merge dataframe non exact value
#1
Hello,

Hope someone will be able to help me here, it’s been days Im trying to merge two dataframe with value that are not exactly the same.
To be able to do that Im using recordlinkage:
df1 : 8411 rows, 15 columns
df2 : 79 rows, 8columns
I have the invoice number in both dataframe but not written exactly the same way on both Df.
I want to add information from df1 to df2 using the column INV_NO but the information is not written properly in both columns :
for example in df2 you have T49528 and in df1 you have ST49528
an other example df2 you have 2587 and df1 you have C5635-002587.
my code so far :

import pandas as pd
import recordlinkage
df1 = pd.read_excel(r’C:…\File.xlsx’)
df2 = pd.read_excel(r’C:…\File2.xlsx’)

#to make sure all lines are at string format
df2[‘INV_NO’] = df2[‘INV_NO’].astype(str)
df1[‘INV_NO’] = df1[‘INV_NO’].astype(str)

indexer = recordlinkage.Index()
indexer.full()
candidates = indexer.index(df2,df1)
compare = recordlinkage.Compare()
compare.string(‘INV_NO’,‘INV_NO’,threshold=0.85,label=‘inv’)
features = compare.compute(candidates,df2,df1)

#merge result with df2
matches = features[features['inv']==1]
duplicate_rows = matches.index.get_level_values(1)
df3=df1[df1.index.isin(duplicate_rows)]
final_merge = df2.merge(df3,how='left',on='INV_NO')
final_merge.to_csv('final_merge.csv')
When I look at the result (features df) I can see the line with value 2587 is not there that means it did not find a match on df1 but I know there is one (C5635-002587).

I really don’t understand why. If someone can help me please.
Is there a way to merge information between those 2 dataframes with a value that is not exactly the same in both dataframe ?

Also, I know there is something wrong in my merge because I have duplicate rows in the file.
What is the proper way to merge please ?

Thanks in advance
(Is there a way to add files to the post ?)
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Pandas dataframe merge snmmat 1 2,097 Mar-09-2020, 06:56 PM
Last Post: jefsummers
  tell exact difference between xlim() and xticks() function ift38375 3 5,182 Jul-12-2019, 12:04 AM
Last Post: scidam
  finding exact and similar matches from pandas dataframe? PrateekG 0 4,201 Apr-22-2018, 01:22 PM
Last Post: PrateekG

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020