Python Forum
String comparison in a csv file in Python Pandas
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
String comparison in a csv file in Python Pandas
#1
Photo 
I am currently working on a huge csv file which I need to select a row and compare it with every other row. Which then should return me how much of the first string is in the second string. If the first string is "Card" and the second string is "Credit Card Debit Card" it should return 2 to me. The find_similar function does that but it doesn't work like I want it to. What I currently have is this:
def find_similar(a,b):
    return a & b


def similarity_test(a):
    strword1 = set(dff['Product'].str.split().iloc[a])
    for j in range(0,100):
        strword2 = set(dff['Product'].str.split().iloc[j])
        lenofWord1 = len(strword1)
        lenofWord2 = len(strword2)
        samewords = find_similar(strword1,strword2)
        samelen = len(samewords)
        if samelen == 0:
            print("Not alike")
        else:
            if lenofWord1 > lenofWord2:
                length = lenofWord1
                print("%", (samelen / length) * 100)
            elif lenofWord1 < lenofWord2:
                length = lenofWord2
                print("%", (samelen / length) * 100)
            elif lenofWord1 == lenofWord2:
                length = lenofWord1
                print("%", (samelen / length) * 100)

data = int(input("Which index should be tested:"))
similarity_test(data)
It is working good when it is 100% similar or there is no repeating words of string1 in string2. What it should do is in the picture but it gives me 12.5% instead of 25%. Any help on how can I solve this? I've included everything I use in the code excluding the dataframe. Thanks in advance.[Image: E3m4J.png]
Reply
#2
Every element in a set is unique, so a set of Card, Card, Cardy, Card only contains 1 card.

I don't understand your example. I think the correct answer is 25%

The only word that appears in both sentences is "credit".
There are 12 words, 3 of which are credit 3/12 * 100 = 25%
There are 8 words in the first sentence, two of which are "credit". 2/8 * 100 = 25%
There are 4 words in the 2nd sentence, one of which is "credit". 1/4 * 100 = 25%

How about this example:
1. This is a game I like
2. Are you game to play a game

Total words = 13
Count of words that appear in both sentences: a:2, game:3
match = 5/13 = 38%

If that is correct, I would write this using a Counting dictionary.
from collections import Counter

def similarity_test(a, b):
    a = a.lower().split()
    b = b.lower().split()
    word_count = len(a) + len(b)

    a = Counter(a)
    b = Counter(b)
    match_count = 0
    for match in set(a) & set(b):
        match_count += (a[match] + b[match])
    return match_count / word_count


a = input("Line 1: ")
b = input("Line 2: ")
print(similarity_test(a, b) * 100)
Another way to look at this is count the number of words that appear in both sentences and compare that to the number of words that appear in either sentence. This can be done with sets.
def similarity_test(a, b):
    a = set(a.lower().split())
    b = set(b.lower().split())
    return len(a & b) / len( a | b)


a = input("Line 1: ")
b = input("Line 2: ")
print(similarity_test(a, b) * 100)
When I run your example in this test I get 11.11% There is 1 common word (credit) and 9 unique words.
Reply
#3
(Nov-18-2022, 08:57 PM)deanhystad Wrote: Every element in a set is unique, so a set of Card, Card, Cardy, Card only contains 1 card.

Thank you for the reply! So the thing I really want is to test how many times a word appears in a sentence. How can I do that? By using lists?
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Need to replace a string with a file (HTML file) tester_V 1 776 Aug-30-2023, 03:42 AM
Last Post: Larz60+
  Problem in saving .xlsm (excel) file using pandas dataframe in python shantanu97 2 4,319 Aug-29-2021, 12:39 PM
Last Post: snippsat
  code for CSV file to html file without pandas jony057 1 2,976 Apr-24-2021, 09:41 PM
Last Post: snippsat
  Python greater than equal to comparison operator Ilangos 4 2,429 Sep-26-2020, 03:53 AM
Last Post: buran
  Python Parameter inside Json file treated as String dhiliptcs 0 1,851 Dec-10-2019, 07:28 PM
Last Post: dhiliptcs
  Comparison Operator "is" idle vs python command spisatus 3 2,800 Oct-29-2019, 10:00 PM
Last Post: DeaD_EyE
  I am trying to read a pandas file Balaji 1 1,955 Oct-08-2019, 10:55 PM
Last Post: Larz60+
  Python 2.7 LooseVersion version comparison unexpectedly fails kwutzke 5 4,156 Nov-27-2018, 10:23 AM
Last Post: kwutzke
  List comparison in Python Nirmal 4 3,088 Sep-26-2018, 06:23 PM
Last Post: nilamo
  List comparison in Python Nirmal 13 6,498 Aug-09-2018, 04:30 PM
Last Post: Vysero

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020