Python Forum
Make unique id in vectorized way based on text data column with similarity scoring
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Make unique id in vectorized way based on text data column with similarity scoring
#1
I have the following dataframe:

d_test = {
    'name' : ['South Beach', 'Dog', 'Bird', 'Ant', 'Big Dog', 'Beach', 'Dear', 'Cat', 'Fish', 'Dry Fish'],
    'cluster_number' : [1, 2, 3, 3, 2, 1, 4, 2, 2, 2]
}
df_test = pd.DataFrame(d_test)
I want to identify similar names in name column if those names belong to one cluster number and create unique id for them. For example South Beach and Beach belong to cluster number 1 and their similarity score is pretty high. So we associate it with unique id, say 1. Next cluster is number 2 and three entities from name column belong to this cluster: Dog, Big Dog, Cat, 'Fish' and 'Dry Fish'. Dog and Big Dog have high similarity score and their unique id will be, say 2. For Cat unique id will be, say 3. Finally for 'Fish' and 'Dry Fish' unique id will be, say 4. And so on.


I created a code for the logic above:

 # pip install thefuzz
 from thefuzz import fuzz
 
df_test = df_test.sort_values(['cluster_number', 'name'])
df_test.reset_index(drop=True, inplace=True)

df_test['id'] = 0

i = 1
for index, row in df_test.iterrows():
    row_ = row
    index_ = index
    
    while index_ < len(df_test) and df_test.loc[index, 'cluster_number'] == df_test.loc[index_, 'cluster_number'] and df_test.loc[index_, 'id'] == 0:
        if row['name'] == df_test.loc[index_, 'name'] or fuzz.ratio(row['name'], df_test.loc[index_, 'name']) > 50:
            df_test.loc[index_,'id'] = i
            is_i_used = True
        index_ += 1
        
    if is_i_used == True:
        i += 1
        is_i_used = False


Code generates expected result:


Output:
name cluster_number id 0 Beach 1 1 1 South Beach 1 1 2 Big Dog 2 2 3 Cat 2 3 4 Dog 2 2 5 Dry Fish 2 4 6 Fish 2 4 7 Ant 3 5 8 Bird 3 6 9 Dear 4 7
Computation runs for 210 seconds for dataframe with 1 million rows where in average each cluster has about 10 rows and max cluster size is about 200 rows. I am trying to understand how to vectorize the code.

I can see that thefuzz module has process function and it allows to process data at once:
    from thefuzz import process
    out = process.extract("Beach", df_test['name'], limit=len(df_test))
But I don't see if it can help with speeding up the code.

Also I was looking for pandas .apply() method but I do not see how I may use it for such complicated algorithm with for and while statements.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Supervised learning, tree based model - problems splitting data Pixel 0 673 May-16-2023, 05:25 PM
Last Post: Pixel
  Grouping Data based on 30% bracket purnima1 0 961 Feb-16-2023, 07:14 PM
Last Post: purnima1
  how to expand each unique value in another column and fill zero if no match SriRajesh 0 843 Jul-10-2022, 09:21 AM
Last Post: SriRajesh
  New Dataframe Column Based on Several Conditions nb1214 1 1,819 Nov-16-2021, 10:52 PM
Last Post: jefsummers
  pandas: Compute the % of the unique values in a column JaneTan 1 1,790 Oct-25-2021, 07:55 PM
Last Post: jefsummers
  What if a column has about 90% of data as outliers? Asahavey17 1 1,834 Aug-23-2021, 04:55 PM
Last Post: jefsummers
  Pandas Data frame column condition check based on length of the value aditi06 1 2,704 Jul-28-2021, 11:08 AM
Last Post: jefsummers
  Adding a new column to a Panda Data Frame rsherry8 2 2,133 Jun-06-2021, 06:49 PM
Last Post: jefsummers
Question [Solved] How to refer to dataframe column name based on a list lorensa74 1 2,278 May-17-2021, 07:02 AM
Last Post: lorensa74
  Add column based on others timste 8 4,065 Apr-03-2021, 07:39 AM
Last Post: devesh_sahu

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020