Python Forum

Full Version: Find closest string pattern
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
I need a way to identify the closest matching list based on the string based on the sequence of words.

for example.
my_list = ['one', 'two', 'three' , 'four']

option_one = ['two', 'three', 'one', 'four']
option_two = ['four', 'three', 'two', 'one']

The closest sequence of the two options to my_list is option_one. The order of the positions are not the same as my_list but they are closer than option_two.
edit: to clarify, I'm looking for the closest match reading left to right in sequence. Though it doesn't equal my_list, it closer that option_two.

And that's about where my brain stops. I'm not sure of an elegant way to solve this. I'm not looking for a coded solution, but if anyone has an idea on how to go about solving it, I'm all ears.

Thanks!
"closest matching list" seems undefined here. Is this homework, where you have thorough instructions, or is the requirement coming from somewhere else?
How do you know option_one is closer? You can't put that into code before you can put it into words.
(Dec-12-2018, 07:37 PM)nilamo Wrote: [ -> ]How do you know option_one is closer? You can't put that into code before you can put it into words.

Though the string values don't line up exactly with my_list, the sequence is closer reading from left to right than option two.
Every element of option_two is only one element away from my_list. option_one's "one" is two elements away. So why isn't option_two the closer match?
(Dec-12-2018, 07:09 PM)micseydel Wrote: [ -> ]"closest matching list" seems undefined here. Is this homework, where you have thorough instructions, or is the requirement coming from somewhere else?
nah, not homework. I was listening to a lecture about NLP and the difference between identifying words in a string, identifying the sequence of words in a string and intent. It just had me wondering how to find the closest matching sentence based on matching word order from left to right.
ie.
my_string = "I that the dead beat cat"

option_one = "Is the cat dead"
option_two = "I beat the dead cat"

both contain some/all of the words as my_string. But one is closer to matching my_string on the sequence of words from left to right.
You could use a metric similar to the edit distance, ie the minimal number of operations required to transform one list into another.
(Dec-12-2018, 07:52 PM)nilamo Wrote: [ -> ]Every element of option_two is only one element away from my_list. option_one's "one" is two elements away. So why isn't option_two the closer match?

thank you for that perspective. I added an edit to help clarify that.

my_list = red, red, red, blue, black
option1 = blue, red
option2 = red, black

option2 shares the closest order as my_list reading left to right.
Using the normalized compression distance, option 1 is better in the first example and option 2 is better in the second example
from zlib import compress

def ncd(a, b):
    """Normalized compression distance between two byte strings"""
    u =len(compress(a))
    v =len(compress(b))
    u, v =min(u, v), max (u, v)
    w =len(compress(a + b))
    return float (w - u) / v 

def l2b(alist):
    return ",".join(alist).encode()

def option_distances(alist, *options):
    b = l2b(alist)
    return [(ncd(b, l2b(option)), option)
            for option in options]


if __name__ == '__main__':
    my_list = ['one', 'two', 'three' , 'four']
    option_one = ['two', 'three', 'one', 'four']
    option_two = ['four', 'three', 'two', 'one']
    
    print(my_list)
    for d, option in option_distances(my_list, option_one, option_two):
        print(d, option)
    
    my_list = "red red red blue black".split()
    option1 = "blue red".split()
    option2 = "red black".split()
    
    print(my_list)
    for d, option in option_distances(my_list, option1, option2):
        print(d, option)
Output:
['one', 'two', 'three', 'four'] 0.19230769230769232 ['two', 'three', 'one', 'four'] 0.3076923076923077 ['four', 'three', 'two', 'one'] ['red', 'red', 'red', 'blue', 'black'] 0.43478260869565216 ['blue', 'red'] 0.391304347826087 ['red', 'black']
(Dec-12-2018, 08:54 PM)Gribouillis Wrote: [ -> ]Using the normalized compression distance, option 1 is better in the first example and option 2 is better in the second example
from zlib import compress

def ncd(a, b):
    """Normalized compression distance between two byte strings"""
    u =len(compress(a))
    v =len(compress(b))
    u, v =min(u, v), max (u, v)
    w =len(compress(a + b))
    return float (w - u) / v 

def l2b(alist):
    return ",".join(alist).encode()

def option_distances(alist, *options):
    b = l2b(alist)
    return [(ncd(b, l2b(option)), option)
            for option in options]


if __name__ == '__main__':
    my_list = ['one', 'two', 'three' , 'four']
    option_one = ['two', 'three', 'one', 'four']
    option_two = ['four', 'three', 'two', 'one']
    
    print(my_list)
    for d, option in option_distances(my_list, option_one, option_two):
        print(d, option)
    
    my_list = "red red red blue black".split()
    option1 = "blue red".split()
    option2 = "red black".split()
    
    print(my_list)
    for d, option in option_distances(my_list, option1, option2):
        print(d, option)
Output:
['one', 'two', 'three', 'four'] 0.19230769230769232 ['two', 'three', 'one', 'four'] 0.3076923076923077 ['four', 'three', 'two', 'one'] ['red', 'red', 'red', 'blue', 'black'] 0.43478260869565216 ['blue', 'red'] 0.391304347826087 ['red', 'black']

Perfect, thanks. I'll look at that. Appreciate the direction.