Python Forum

Full Version: How do I improve string similarity in my current code?
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Here’s my current code:
new_list = []

for i in range(len(title1)):
    for j in range(len(title2)):
    r = []
    title_distance = fuzz.token_sort_ratio(title1[i], title2[j])
    if (title_distance > threshold):
        r.append(amazon_s['idAmazon'][i])
        r.append(google_s['idGoogleBase'][j])
        new_list.append(r)

df = pd.DataFrame(new_list)
df.to_csv('task.csv')
title1 here is a list of product title from amazon website
title2 is a list of product title from google website

id1 is an id corresponding to amazon website product
id2 is an id corresponding to google website product

Both of them have a list of product titles,
Case 1: Title is the same
Case 2: Title is similar

After sorting them out,
I would like to output them into an excel file using pandas,
which contain id1 and id2 if Case 1 and Case 2 are satisfied.

Any thoughts on this problem?
You will need to import the pandas module first like import pandas as pd
(May-27-2020, 08:50 AM)Calli Wrote: [ -> ]You will need to import the pandas module first like import pandas as pd

Not the code that i need improving on, but the algorithm to solve this problem.
Right now:

Recall: 0.72307 out of 1
Precision: 0.77049 out of 1

recall = tp/(tp+fn)
precision = tp/(tp+fp)

True Positive Count: 94
False Positive Count: 28
False Negative Count: 36
True Negative Count: 22042
Nice description of the fuzz functions here:

https://stackoverflow.com/questions/3180...-2-strings