Find closest string pattern

jmair · (This post was last modified: Dec-12-2018, 08:10 PM by jmair.)

I need a way to identify the closest matching list based on the string based on the sequence of words.

for example.
my_list = ['one', 'two', 'three' , 'four']

option_one = ['two', 'three', 'one', 'four']
option_two = ['four', 'three', 'two', 'one']

The closest sequence of the two options to my_list is option_one. The order of the positions are not the same as my_list but they are closer than option_two.
edit: to clarify, I'm looking for the closest match reading left to right in sequence. Though it doesn't equal my_list, it closer that option_two.

And that's about where my brain stops. I'm not sure of an elegant way to solve this. I'm not looking for a coded solution, but if anyone has an idea on how to go about solving it, I'm all ears.

Thanks!

***micseydel*** · Dec-12-2018, 07:09 PM

"closest matching list" seems undefined here. Is this homework, where you have thorough instructions, or is the requirement coming from somewhere else?

**nilamo** · Dec-12-2018, 07:37 PM

How do you know option_one is closer? You can't put that into code before you can put it into words.

jmair · Dec-12-2018, 07:49 PM

(Dec-12-2018, 07:37 PM)nilamo Wrote: How do you know option_one is closer? You can't put that into code before you can put it into words.

Though the string values don't line up exactly with my_list, the sequence is closer reading from left to right than option two.

**nilamo** · Dec-12-2018, 07:52 PM

Every element of option_two is only one element away from my_list. option_one's "one" is two elements away. So why isn't option_two the closer match?

jmair · Dec-12-2018, 08:05 PM

(Dec-12-2018, 07:09 PM)micseydel Wrote: "closest matching list" seems undefined here. Is this homework, where you have thorough instructions, or is the requirement coming from somewhere else?

nah, not homework. I was listening to a lecture about NLP and the difference between identifying words in a string, identifying the sequence of words in a string and intent. It just had me wondering how to find the closest matching sentence based on matching word order from left to right.
ie.
my_string = "I that the dead beat cat"

option_one = "Is the cat dead"
option_two = "I beat the dead cat"

both contain some/all of the words as my_string. But one is closer to matching my_string on the sequence of words from left to right.

**Gribouillis** · Dec-12-2018, 08:11 PM

You could use a metric similar to the edit distance, ie the minimal number of operations required to transform one list into another.

jmair · Dec-12-2018, 08:14 PM

(Dec-12-2018, 07:52 PM)nilamo Wrote: Every element of option_two is only one element away from my_list. option_one's "one" is two elements away. So why isn't option_two the closer match?

thank you for that perspective. I added an edit to help clarify that.

my_list = red, red, red, blue, black
option1 = blue, red
option2 = red, black

option2 shares the closest order as my_list reading left to right.

**Gribouillis** · (This post was last modified: Dec-12-2018, 08:54 PM by Gribouillis.)

Using the normalized compression distance, option 1 is better in the first example and option 2 is better in the second example

from zlib import compress

def ncd(a, b):
    """Normalized compression distance between two byte strings"""
    u =len(compress(a))
    v =len(compress(b))
    u, v =min(u, v), max (u, v)
    w =len(compress(a + b))
    return float (w - u) / v 

def l2b(alist):
    return ",".join(alist).encode()

def option_distances(alist, *options):
    b = l2b(alist)
    return [(ncd(b, l2b(option)), option)
            for option in options]


if __name__ == '__main__':
    my_list = ['one', 'two', 'three' , 'four']
    option_one = ['two', 'three', 'one', 'four']
    option_two = ['four', 'three', 'two', 'one']
    
    print(my_list)
    for d, option in option_distances(my_list, option_one, option_two):
        print(d, option)
    
    my_list = "red red red blue black".split()
    option1 = "blue red".split()
    option2 = "red black".split()
    
    print(my_list)
    for d, option in option_distances(my_list, option1, option2):
        print(d, option)

Output:['one', 'two', 'three', 'four']
0.19230769230769232 ['two', 'three', 'one', 'four']
0.3076923076923077 ['four', 'three', 'two', 'one']
['red', 'red', 'red', 'blue', 'black']
0.43478260869565216 ['blue', 'red']
0.391304347826087 ['red', 'black']

jmair · Dec-12-2018, 09:17 PM

(Dec-12-2018, 08:54 PM)Gribouillis Wrote: Using the normalized compression distance, option 1 is better in the first example and option 2 is better in the second example

from zlib import compress

def ncd(a, b):
    """Normalized compression distance between two byte strings"""
    u =len(compress(a))
    v =len(compress(b))
    u, v =min(u, v), max (u, v)
    w =len(compress(a + b))
    return float (w - u) / v 

def l2b(alist):
    return ",".join(alist).encode()

def option_distances(alist, *options):
    b = l2b(alist)
    return [(ncd(b, l2b(option)), option)
            for option in options]


if __name__ == '__main__':
    my_list = ['one', 'two', 'three' , 'four']
    option_one = ['two', 'three', 'one', 'four']
    option_two = ['four', 'three', 'two', 'one']
    
    print(my_list)
    for d, option in option_distances(my_list, option_one, option_two):
        print(d, option)
    
    my_list = "red red red blue black".split()
    option1 = "blue red".split()
    option2 = "red black".split()
    
    print(my_list)
    for d, option in option_distances(my_list, option1, option2):
        print(d, option)

Output:['one', 'two', 'three', 'four']
0.19230769230769232 ['two', 'three', 'one', 'four']
0.3076923076923077 ['four', 'three', 'two', 'one']
['red', 'red', 'red', 'blue', 'black']
0.43478260869565216 ['blue', 'red']
0.391304347826087 ['red', 'black']

Perfect, thanks. I'll look at that. Appreciate the direction.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Find and Replace numbers in String	giddyhead	2	2,917	Jul-17-2022, 06:22 PM Last Post: giddyhead
	[SOLVED] Find last occurence of pattern in text file?	Winfried	4	6,495	Aug-13-2021, 08:21 PM Last Post: Winfried
	Find string between two substrings, in a stream of data	xbit	1	2,687	May-09-2021, 03:32 PM Last Post: bowlofred
	Regular expression: cannot find 1st number in a string	Pavel_47	2	3,060	Jan-15-2021, 04:39 PM Last Post: bowlofred
	find a string in a field in MongoDB	Leon79	2	3,267	Jul-19-2020, 09:20 PM Last Post: menator01
	Help to find a string and read the next lines	crlamaral	4	3,419	Mar-19-2020, 09:24 AM Last Post: Larz60+
	How To Find an Opening and Closing String, Copying Open/Close/Contents to New File	davidshq	1	2,592	Mar-03-2020, 04:47 AM Last Post: davidshq
	'Get closest value array in array of arrays.' follow up help.	DreamingInsanity	10	10,450	Dec-05-2019, 06:30 PM Last Post: DreamingInsanity
	Get closest value array for array of arrays.	DreamingInsanity	2	3,360	Nov-18-2019, 03:55 PM Last Post: DreamingInsanity
	finding the closest floating point number in a list	Skaperen	17	12,367	Sep-19-2019, 10:39 PM Last Post: Skaperen

Find closest string pattern

User Panel Messages

Announcements