Python Forum

Full Version: String comparison in a csv file in Python Pandas
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
I am currently working on a huge csv file which I need to select a row and compare it with every other row. Which then should return me how much of the first string is in the second string. If the first string is "Card" and the second string is "Credit Card Debit Card" it should return 2 to me. The find_similar function does that but it doesn't work like I want it to. What I currently have is this:
def find_similar(a,b):
    return a & b


def similarity_test(a):
    strword1 = set(dff['Product'].str.split().iloc[a])
    for j in range(0,100):
        strword2 = set(dff['Product'].str.split().iloc[j])
        lenofWord1 = len(strword1)
        lenofWord2 = len(strword2)
        samewords = find_similar(strword1,strword2)
        samelen = len(samewords)
        if samelen == 0:
            print("Not alike")
        else:
            if lenofWord1 > lenofWord2:
                length = lenofWord1
                print("%", (samelen / length) * 100)
            elif lenofWord1 < lenofWord2:
                length = lenofWord2
                print("%", (samelen / length) * 100)
            elif lenofWord1 == lenofWord2:
                length = lenofWord1
                print("%", (samelen / length) * 100)

data = int(input("Which index should be tested:"))
similarity_test(data)
It is working good when it is 100% similar or there is no repeating words of string1 in string2. What it should do is in the picture but it gives me 12.5% instead of 25%. Any help on how can I solve this? I've included everything I use in the code excluding the dataframe. Thanks in advance.[Image: E3m4J.png]
Every element in a set is unique, so a set of Card, Card, Cardy, Card only contains 1 card.

I don't understand your example. I think the correct answer is 25%

The only word that appears in both sentences is "credit".
There are 12 words, 3 of which are credit 3/12 * 100 = 25%
There are 8 words in the first sentence, two of which are "credit". 2/8 * 100 = 25%
There are 4 words in the 2nd sentence, one of which is "credit". 1/4 * 100 = 25%

How about this example:
1. This is a game I like
2. Are you game to play a game

Total words = 13
Count of words that appear in both sentences: a:2, game:3
match = 5/13 = 38%

If that is correct, I would write this using a Counting dictionary.
from collections import Counter

def similarity_test(a, b):
    a = a.lower().split()
    b = b.lower().split()
    word_count = len(a) + len(b)

    a = Counter(a)
    b = Counter(b)
    match_count = 0
    for match in set(a) & set(b):
        match_count += (a[match] + b[match])
    return match_count / word_count


a = input("Line 1: ")
b = input("Line 2: ")
print(similarity_test(a, b) * 100)
Another way to look at this is count the number of words that appear in both sentences and compare that to the number of words that appear in either sentence. This can be done with sets.
def similarity_test(a, b):
    a = set(a.lower().split())
    b = set(b.lower().split())
    return len(a & b) / len( a | b)


a = input("Line 1: ")
b = input("Line 2: ")
print(similarity_test(a, b) * 100)
When I run your example in this test I get 11.11% There is 1 common word (credit) and 9 unique words.
(Nov-18-2022, 08:57 PM)deanhystad Wrote: [ -> ]Every element in a set is unique, so a set of Card, Card, Cardy, Card only contains 1 card.

Thank you for the reply! So the thing I really want is to test how many times a word appears in a sentence. How can I do that? By using lists?