Python Forum
Fuzzy match on text columns within dataframe
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Fuzzy match on text columns within dataframe
#1
I have two datasets df1 and df2, then I did a merge on StateC and CountyC (df1) to STATE_NUMERIC and COUNTY_NUMERIC (df2) for the end result of df3. These four columns are state and county codes. I know my code is not elegant in any way.

Within df3 there are 30 columns that are included which is what I want. I am wanting to do a fuzzy logic match/merge on two columns: Community and FEATURE_NAME. These two columns are text columns that correspond to locations in the United States and I would like a fuzzy match or merge because there may be slight differences between the text. When I do a merge many locations are excluded. If you know of a way that I can do a fuzzy logic match that would be extremely helpful.

df1 is an excel spreadsheet with columns: 'LookupCode', 'Code', 'State', 'StateC', 'CountyC', 'County','Community', 'ASU Code', 'Status', 'Incits']

df1=pd.read_excel('Incits to Community Table.xlsx', dtype= 
{'CountyC':'str', 'StateC':'str', 'Community':'str'})
df1.applymap(lambda x: x.strip() if type(x) is str else x)

df1['Community']=df1['Community'].str.lower()
df1['State']=df1['State'].str.lower()
df1['County']=df1['County'].str.lower()

df1['Community']=df1['Community'].str.strip()
df1['State']=df1['State'].str.strip()
df1['County']=df1['County'].str.strip()
df1['CountyC']=df1['CountyC'].str.strip()
df1['StateC']=df1['StateC'].str.strip()
df1.Community = df1.Community.replace('\s+', ' ', regex=True)
df1.Community.apply(''.join).str.replace('[^A-Za-z\s\n\r]+', '') \
.str.split(expand=True)
df2 is a national location text file with columns: 'FEATURE_ID', 'FEATURE_NAME', 'FEATURE_CLASS', 'STATE_ALPHA','STATE_NUMERIC', 'COUNTY_NAME', 'COUNTY_NUMERIC', 'PRIMARY_LAT_DMS', 'PRIM_LONG_DMS', 'PRIM_LAT_DEC', 'PRIM_LONG_DEC', 'SOURCE_LAT_DMS', 'SOURCE_LONG_DMS', 'SOURCE_LAT_DEC', 'SOURCE_LONG_DEC', 'ELEV_IN_M','ELEV_IN_FT', 'MAP_NAME', 'DATE_CREATED', 'DATE_EDITED']

#Read in National File and place into dataframe 2
df2=pd.read_csv('NationalFile_20180801.txt', "|", dtype= 
{'COUNTY_NUMERIC':'str', 'STATE_NUMERIC':'str', 'FEATURE_NAME':'str'}

df2['FEATURE_NAME']=df2['FEATURE_NAME'].str.strip()
df2['STATE_ALPHA']=df2['STATE_ALPHA'].str.strip()
df2['COUNTY_NAME']=df2['COUNTY_NAME'].str.strip()
df2['COUNTY_NUMERIC']=df2['COUNTY_NUMERIC'].str.strip()
df2['STATE_NUMERIC']=df2['STATE_NUMERIC'].str.strip()

df2['FEATURE_NAME']=df2['FEATURE_NAME'].str.lower()
df2['STATE_ALPHA']=df2['STATE_ALPHA'].str.lower()
df2['COUNTY_NAME']=df2['COUNTY_NAME'].str.lower()
df2.FEATURE_NAME = df2.FEATURE_NAME.replace('\s+', ' ', regex=True)
df3 is a merged dataset between state and county codes in df1 and df2

df3=pd.merge(df1,df2, how='inner', left_on=['StateC', 
'CountyC'],right_on=['STATE_NUMERIC', 'COUNTY_NUMERIC'])
df3.shape
Would like to create dataset (df4) that contains the results from DF3 and a fuzzy match between the two text columns 'Community' and 'FEATURE_NAME'. I have tried different codes for the fuzzy matching and fuzzymatcher does not process (spins forever) and difflib returns an index error.

[python]df3=fuzzymatcher.fuzzy_left_join(df1, df2, left_on="Community", 
right_on="FEATURE_NAME")
df1['Community'] = df1['Community'].apply(lambda x: 
difflib.get_close_matches(x, df2['FEATURE_NAME'])[0])
df3=df1.merge(df2)
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Fuzzy Wuzzy how to get my fuzzyratio displayed windingsareuseful 3 215 Apr-04-2024, 05:38 PM
Last Post: deanhystad
  Converting a json file to a dataframe with rows and columns eyavuz21 13 4,364 Jan-29-2023, 03:59 PM
Last Post: eyavuz21
  Nested for loops: Iterating over columns of a DataFrame to plot on subplots dm222 0 1,690 Aug-19-2022, 11:07 AM
Last Post: dm222
  Apply fillna to multiple columns in dataframe rraillon 2 2,410 Aug-05-2021, 01:11 PM
Last Post: rraillon
  Open and read multiple text files and match words kozaizsvemira 3 6,724 Jul-07-2021, 11:27 AM
Last Post: Larz60+
  How to rename dataframe columns based on the content in an index? ar_mahdavi 2 2,460 Jun-07-2021, 06:09 AM
Last Post: ricslato
  How to sum across variable columns in a dataframe rennerom 2 2,742 Jan-31-2021, 05:44 PM
Last Post: rennerom
  How to split dataframe object rows to columns Mekala 1 2,487 Nov-12-2020, 04:18 PM
Last Post: michael1789
  convert list to five columns dataframe in sequence tonycat 2 2,463 Sep-29-2020, 06:47 AM
Last Post: tonycat
  How to melt dataframe multiple columns to one column Mekala 1 2,859 Sep-24-2020, 08:32 PM
Last Post: scidam

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020