Make unique id in vectorized way based on text data column with similarity scoring

ill8 · (This post was last modified: Dec-12-2022, 03:23 AM by ill8.)

I have the following dataframe:

d_test = {
    'name' : ['South Beach', 'Dog', 'Bird', 'Ant', 'Big Dog', 'Beach', 'Dear', 'Cat', 'Fish', 'Dry Fish'],
    'cluster_number' : [1, 2, 3, 3, 2, 1, 4, 2, 2, 2]
}
df_test = pd.DataFrame(d_test)

I want to identify similar names in name column if those names belong to one cluster number and create unique id for them. For example South Beach and Beach belong to cluster number 1 and their similarity score is pretty high. So we associate it with unique id, say 1. Next cluster is number 2 and three entities from name column belong to this cluster: Dog, Big Dog, Cat, 'Fish' and 'Dry Fish'. Dog and Big Dog have high similarity score and their unique id will be, say 2. For Cat unique id will be, say 3. Finally for 'Fish' and 'Dry Fish' unique id will be, say 4. And so on.

I created a code for the logic above:

 # pip install thefuzz
 from thefuzz import fuzz
 
df_test = df_test.sort_values(['cluster_number', 'name'])
df_test.reset_index(drop=True, inplace=True)

df_test['id'] = 0

i = 1
for index, row in df_test.iterrows():
    row_ = row
    index_ = index
    
    while index_ < len(df_test) and df_test.loc[index, 'cluster_number'] == df_test.loc[index_, 'cluster_number'] and df_test.loc[index_, 'id'] == 0:
        if row['name'] == df_test.loc[index_, 'name'] or fuzz.ratio(row['name'], df_test.loc[index_, 'name']) > 50:
            df_test.loc[index_,'id'] = i
            is_i_used = True
        index_ += 1
        
    if is_i_used == True:
        i += 1
        is_i_used = False

Code generates expected result:

Output:	name	cluster_number	id
0	Beach		1	1
1	South Beach	1	1
2	Big Dog		2	2
3	Cat		2	3
4	Dog		2	2
5	Dry Fish	2	4
6	Fish		2	4
7	Ant		3	5
8	Bird		3	6
9	Dear		4	7

Computation runs for 210 seconds for dataframe with 1 million rows where in average each cluster has about 10 rows and max cluster size is about 200 rows. I am trying to understand how to vectorize the code.

I can see that thefuzz module has process function and it allows to process data at once:

    from thefuzz import process
    out = process.extract("Beach", df_test['name'], limit=len(df_test))

But I don't see if it can help with speeding up the code.

Also I was looking for pandas .apply() method but I do not see how I may use it for such complicated algorithm with for and while statements.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Create dataframe from the unique data of two dataframes	Calab	6	850	Mar-02-2025, 01:51 PM Last Post: Pedroski55
	Supervised learning, tree based model - problems splitting data	Pixel	0	1,271	May-16-2023, 05:25 PM Last Post: Pixel
	Grouping Data based on 30% bracket	purnima1	0	1,386	Feb-16-2023, 07:14 PM Last Post: purnima1
	how to expand each unique value in another column and fill zero if no match	SriRajesh	0	1,355	Jul-10-2022, 09:21 AM Last Post: SriRajesh
	New Dataframe Column Based on Several Conditions	nb1214	1	2,494	Nov-16-2021, 10:52 PM Last Post: jefsummers
	pandas: Compute the % of the unique values in a column	JaneTan	1	2,399	Oct-25-2021, 07:55 PM Last Post: jefsummers
	What if a column has about 90% of data as outliers?	Asahavey17	1	2,366	Aug-23-2021, 04:55 PM Last Post: jefsummers
	Pandas Data frame column condition check based on length of the value	aditi06	1	3,709	Jul-28-2021, 11:08 AM Last Post: jefsummers
	Adding a new column to a Panda Data Frame	rsherry8	2	2,937	Jun-06-2021, 06:49 PM Last Post: jefsummers
	[Solved] How to refer to dataframe column name based on a list	lorensa74	1	2,984	May-17-2021, 07:02 AM Last Post: lorensa74

Make unique id in vectorized way based on text data column with similarity scoring

User Panel Messages

Announcements