Fuzzy match on text columns within dataframe

Nsaibot · Aug-27-2018, 10:52 PM

I have two datasets df1 and df2, then I did a merge on StateC and CountyC (df1) to STATE_NUMERIC and COUNTY_NUMERIC (df2) for the end result of df3. These four columns are state and county codes. I know my code is not elegant in any way.

Within df3 there are 30 columns that are included which is what I want. I am wanting to do a fuzzy logic match/merge on two columns: Community and FEATURE_NAME. These two columns are text columns that correspond to locations in the United States and I would like a fuzzy match or merge because there may be slight differences between the text. When I do a merge many locations are excluded. If you know of a way that I can do a fuzzy logic match that would be extremely helpful.

df1 is an excel spreadsheet with columns: 'LookupCode', 'Code', 'State', 'StateC', 'CountyC', 'County','Community', 'ASU Code', 'Status', 'Incits']

df1=pd.read_excel('Incits to Community Table.xlsx', dtype= 
{'CountyC':'str', 'StateC':'str', 'Community':'str'})
df1.applymap(lambda x: x.strip() if type(x) is str else x)

df1['Community']=df1['Community'].str.lower()
df1['State']=df1['State'].str.lower()
df1['County']=df1['County'].str.lower()

df1['Community']=df1['Community'].str.strip()
df1['State']=df1['State'].str.strip()
df1['County']=df1['County'].str.strip()
df1['CountyC']=df1['CountyC'].str.strip()
df1['StateC']=df1['StateC'].str.strip()
df1.Community = df1.Community.replace('\s+', ' ', regex=True)
df1.Community.apply(''.join).str.replace('[^A-Za-z\s\n\r]+', '') \
.str.split(expand=True)

df2 is a national location text file with columns: 'FEATURE_ID', 'FEATURE_NAME', 'FEATURE_CLASS', 'STATE_ALPHA','STATE_NUMERIC', 'COUNTY_NAME', 'COUNTY_NUMERIC', 'PRIMARY_LAT_DMS', 'PRIM_LONG_DMS', 'PRIM_LAT_DEC', 'PRIM_LONG_DEC', 'SOURCE_LAT_DMS', 'SOURCE_LONG_DMS', 'SOURCE_LAT_DEC', 'SOURCE_LONG_DEC', 'ELEV_IN_M','ELEV_IN_FT', 'MAP_NAME', 'DATE_CREATED', 'DATE_EDITED']

#Read in National File and place into dataframe 2
df2=pd.read_csv('NationalFile_20180801.txt', "|", dtype= 
{'COUNTY_NUMERIC':'str', 'STATE_NUMERIC':'str', 'FEATURE_NAME':'str'}

df2['FEATURE_NAME']=df2['FEATURE_NAME'].str.strip()
df2['STATE_ALPHA']=df2['STATE_ALPHA'].str.strip()
df2['COUNTY_NAME']=df2['COUNTY_NAME'].str.strip()
df2['COUNTY_NUMERIC']=df2['COUNTY_NUMERIC'].str.strip()
df2['STATE_NUMERIC']=df2['STATE_NUMERIC'].str.strip()

df2['FEATURE_NAME']=df2['FEATURE_NAME'].str.lower()
df2['STATE_ALPHA']=df2['STATE_ALPHA'].str.lower()
df2['COUNTY_NAME']=df2['COUNTY_NAME'].str.lower()
df2.FEATURE_NAME = df2.FEATURE_NAME.replace('\s+', ' ', regex=True)

df3 is a merged dataset between state and county codes in df1 and df2

df3=pd.merge(df1,df2, how='inner', left_on=['StateC', 
'CountyC'],right_on=['STATE_NUMERIC', 'COUNTY_NUMERIC'])
df3.shape

Would like to create dataset (df4) that contains the results from DF3 and a fuzzy match between the two text columns 'Community' and 'FEATURE_NAME'. I have tried different codes for the fuzzy matching and fuzzymatcher does not process (spins forever) and difflib returns an index error.

[python]df3=fuzzymatcher.fuzzy_left_join(df1, df2, left_on="Community", 
right_on="FEATURE_NAME")

df1['Community'] = df1['Community'].apply(lambda x: 
difflib.get_close_matches(x, df2['FEATURE_NAME'])[0])
df3=df1.merge(df2)

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Fuzzy Wuzzy how to get my fuzzyratio displayed	windingsareuseful	3	215	Apr-04-2024, 05:38 PM Last Post: deanhystad
	Converting a json file to a dataframe with rows and columns	eyavuz21	13	4,364	Jan-29-2023, 03:59 PM Last Post: eyavuz21
	Nested for loops: Iterating over columns of a DataFrame to plot on subplots	dm222	0	1,690	Aug-19-2022, 11:07 AM Last Post: dm222
	Apply fillna to multiple columns in dataframe	rraillon	2	2,410	Aug-05-2021, 01:11 PM Last Post: rraillon
	Open and read multiple text files and match words	kozaizsvemira	3	6,724	Jul-07-2021, 11:27 AM Last Post: Larz60+
	How to rename dataframe columns based on the content in an index?	ar_mahdavi	2	2,460	Jun-07-2021, 06:09 AM Last Post: ricslato
	How to sum across variable columns in a dataframe	rennerom	2	2,742	Jan-31-2021, 05:44 PM Last Post: rennerom
	How to split dataframe object rows to columns	Mekala	1	2,487	Nov-12-2020, 04:18 PM Last Post: michael1789
	convert list to five columns dataframe in sequence	tonycat	2	2,463	Sep-29-2020, 06:47 AM Last Post: tonycat
	How to melt dataframe multiple columns to one column	Mekala	1	2,859	Sep-24-2020, 08:32 PM Last Post: scidam

Fuzzy match on text columns within dataframe

User Panel Messages

Announcements