I have the following dataframe:
I created a code for the logic above:
Code generates expected result:
I can see that
Also I was looking for pandas
d_test = { 'name' : ['South Beach', 'Dog', 'Bird', 'Ant', 'Big Dog', 'Beach', 'Dear', 'Cat', 'Fish', 'Dry Fish'], 'cluster_number' : [1, 2, 3, 3, 2, 1, 4, 2, 2, 2] } df_test = pd.DataFrame(d_test)I want to identify similar names in
name
column if those names belong to one cluster number and create unique id for them. For example South Beach
and Beach
belong to cluster number 1
and their similarity score is pretty high. So we associate it with unique id, say 1
. Next cluster is number 2
and three entities from name
column belong to this cluster: Dog
, Big Dog
, Cat
, 'Fish' and 'Dry Fish'. Dog
and Big Dog
have high similarity score and their unique id will be, say 2
. For Cat
unique id will be, say 3
. Finally for 'Fish' and 'Dry Fish' unique id will be, say 4
. And so on.I created a code for the logic above:
# pip install thefuzz from thefuzz import fuzz df_test = df_test.sort_values(['cluster_number', 'name']) df_test.reset_index(drop=True, inplace=True) df_test['id'] = 0 i = 1 for index, row in df_test.iterrows(): row_ = row index_ = index while index_ < len(df_test) and df_test.loc[index, 'cluster_number'] == df_test.loc[index_, 'cluster_number'] and df_test.loc[index_, 'id'] == 0: if row['name'] == df_test.loc[index_, 'name'] or fuzz.ratio(row['name'], df_test.loc[index_, 'name']) > 50: df_test.loc[index_,'id'] = i is_i_used = True index_ += 1 if is_i_used == True: i += 1 is_i_used = False
Code generates expected result:
Output: name cluster_number id
0 Beach 1 1
1 South Beach 1 1
2 Big Dog 2 2
3 Cat 2 3
4 Dog 2 2
5 Dry Fish 2 4
6 Fish 2 4
7 Ant 3 5
8 Bird 3 6
9 Dear 4 7
Computation runs for 210 seconds for dataframe with 1 million rows where in average each cluster has about 10 rows and max cluster size is about 200 rows. I am trying to understand how to vectorize the code.I can see that
thefuzz
module has process
function and it allows to process data at once:from thefuzz import process out = process.extract("Beach", df_test['name'], limit=len(df_test))But I don't see if it can help with speeding up the code.
Also I was looking for pandas
.apply()
method but I do not see how I may use it for such complicated algorithm with for
and while
statements.