String comparison in a csv file in Python Pandas

fleafy · Nov-18-2022, 08:05 PM

I am currently working on a huge csv file which I need to select a row and compare it with every other row. Which then should return me how much of the first string is in the second string. If the first string is "Card" and the second string is "Credit Card Debit Card" it should return 2 to me. The find_similar function does that but it doesn't work like I want it to. What I currently have is this:

def find_similar(a,b):
    return a & b


def similarity_test(a):
    strword1 = set(dff['Product'].str.split().iloc[a])
    for j in range(0,100):
        strword2 = set(dff['Product'].str.split().iloc[j])
        lenofWord1 = len(strword1)
        lenofWord2 = len(strword2)
        samewords = find_similar(strword1,strword2)
        samelen = len(samewords)
        if samelen == 0:
            print("Not alike")
        else:
            if lenofWord1 > lenofWord2:
                length = lenofWord1
                print("%", (samelen / length) * 100)
            elif lenofWord1 < lenofWord2:
                length = lenofWord2
                print("%", (samelen / length) * 100)
            elif lenofWord1 == lenofWord2:
                length = lenofWord1
                print("%", (samelen / length) * 100)

data = int(input("Which index should be tested:"))
similarity_test(data)

It is working good when it is 100% similar or there is no repeating words of string1 in string2. What it should do is in the picture but it gives me 12.5% instead of 25%. Any help on how can I solve this? I've included everything I use in the code excluding the dataframe. Thanks in advance. [Image: E3m4J.png]

**deanhystad** · (This post was last modified: Nov-19-2022, 05:47 AM by deanhystad.)

Every element in a set is unique, so a set of Card, Card, Cardy, Card only contains 1 card.

I don't understand your example. I think the correct answer is 25%

The only word that appears in both sentences is "credit".
There are 12 words, 3 of which are credit 3/12 * 100 = 25%
There are 8 words in the first sentence, two of which are "credit". 2/8 * 100 = 25%
There are 4 words in the 2nd sentence, one of which is "credit". 1/4 * 100 = 25%

How about this example:
1. This is a game I like
2. Are you game to play a game

Total words = 13
Count of words that appear in both sentences: a:2, game:3
match = 5/13 = 38%

If that is correct, I would write this using a Counting dictionary.

from collections import Counter

def similarity_test(a, b):
    a = a.lower().split()
    b = b.lower().split()
    word_count = len(a) + len(b)

    a = Counter(a)
    b = Counter(b)
    match_count = 0
    for match in set(a) & set(b):
        match_count += (a[match] + b[match])
    return match_count / word_count


a = input("Line 1: ")
b = input("Line 2: ")
print(similarity_test(a, b) * 100)

Another way to look at this is count the number of words that appear in both sentences and compare that to the number of words that appear in either sentence. This can be done with sets.

def similarity_test(a, b):
    a = set(a.lower().split())
    b = set(b.lower().split())
    return len(a & b) / len( a | b)


a = input("Line 1: ")
b = input("Line 2: ")
print(similarity_test(a, b) * 100)

When I run your example in this test I get 11.11% There is 1 common word (credit) and 9 unique words.

fleafy · Nov-18-2022, 09:38 PM

(Nov-18-2022, 08:57 PM)deanhystad Wrote: Every element in a set is unique, so a set of Card, Card, Cardy, Card only contains 1 card.

Thank you for the reply! So the thing I really want is to test how many times a word appears in a sentence. How can I do that? By using lists?

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Need to replace a string with a file (HTML file)	tester_V	1	1,978	Aug-30-2023, 03:42 AM Last Post: Larz60+
	Problem in saving .xlsm (excel) file using pandas dataframe in python	shantanu97	2	6,290	Aug-29-2021, 12:39 PM Last Post: snippsat
	code for CSV file to html file without pandas	jony057	1	4,090	Apr-24-2021, 09:41 PM Last Post: snippsat
	Python greater than equal to comparison operator	Ilangos	4	3,549	Sep-26-2020, 03:53 AM Last Post: buran
	Python Parameter inside Json file treated as String	dhiliptcs	0	2,487	Dec-10-2019, 07:28 PM Last Post: dhiliptcs
	Comparison Operator "is" idle vs python command	spisatus	3	3,646	Oct-29-2019, 10:00 PM Last Post: DeaD_EyE
	I am trying to read a pandas file	Balaji	1	2,572	Oct-08-2019, 10:55 PM Last Post: Larz60+
	Python 2.7 LooseVersion version comparison unexpectedly fails	kwutzke	5	5,381	Nov-27-2018, 10:23 AM Last Post: kwutzke
	List comparison in Python	Nirmal	4	4,185	Sep-26-2018, 06:23 PM Last Post: nilamo
	List comparison in Python	Nirmal	13	8,941	Aug-09-2018, 04:30 PM Last Post: Vysero

String comparison in a csv file in Python Pandas

User Panel Messages

Announcements