find and group similar words with re?

cartonics · Oct-27-2023, 06:55 AM

If i have 2 list of names

for example
list1 ="Augsburg II, Turkgucu Munchen,Bayern II, Burghausen, Memmingen, Wurzburger Kickers, Ansbach, Buchbach,Aschaffenburg, Schweinfurt ,Illertissen, Bamberg,Schalding, Bayreuth ,Aubstadt, Furth II ,Vilzing, Nurnberg II "

list2 ="Augsburg II,Turkgucu Munich,Bayern Munich II,Wacker Burghausen,Memmingen ,Kickers Würzburg,Aubstadt ,SpVgg Greuther Furth II,FV Illertissen, Eintracht Bamberg 2010,Schalding-Heining Passau, SpVgg Bayreuth,SpVgg Ansbach,TSV Buchbach,Viktoria Aschaffenburg , 1. FC Schweinfurt,Vilzing , 1. FC Norimberga II"

is there some re command to have in output the words that are similar and not equal?

list3 ="Turkgucu Munchen = Turkgucu Munich, Bayern II =Bayern Munich II, Wurzburger Kickers= Kickers Würzburg ... and so on "

i was searching for commands:

re.search(pattern, string, flags=0)

re.search(pattern, sequence).group()

**Gribouillis** · Oct-27-2023, 07:52 AM

The re module cannot do that. You could perhaps find specialized modules that help in Pypi, such as textdistance (untested)

***snippsat*** · Oct-27-2023, 01:36 PM

A similar library to what Gribouillis posted is TheFuzz(eailer called fuzzywuzzy).
Test.

from thefuzz import fuzz

list1 = ["Augsburg II", "Turkgucu Munchen", "Bayern II"]
list2 = ["Augburg II", "Turkgucu Munich", "Baye II"]

>>> fuzz.ratio(list1[0], list2[0])
95
>>> fuzz.ratio(list1[1], list2[1])
90
>>> fuzz.ratio(list1[2], list2[2])
88

Then can decided what ratio is ok to make it similar enuff,let say that choose 90.

from thefuzz import fuzz

list1 = ["Augsburg II", "Turkgucu Munchen", "Bayern II"]
list2 = ["Augburg II", "Turkgucu Munich", "Baye II"]


list3 = []
for l1, l2 in zip(list1, list2):
    if fuzz.ratio(l1, l2) >= 90:
        #print(f'{l1} = {l2}')
        list3.append(f'{l1} = {l2}')

print(list3)

Output:
['Augsburg II = Augburg II', 'Turkgucu Munchen = Turkgucu Munich']

cartonics · Oct-27-2023, 02:38 PM

from thefuzz import fuzz
 
list1 = ["Augsburg II","Turkgucu Munchen","Bayern II","Burghausen","Memmingen","Wurzburger Kickers","Ansbach","Buchbach","Aschaffenburg","Schweinfurt","Illertissen","Bamberg","Schalding"," Bayreuth","Aubstadt","Furth II","Vilzing","Nurnberg II"]
list2 = ["Augsburg II","Turkgucu Munich","Bayern Munich II","Wacker Burghausen","Memmingen ","Kickers Würzburg","Aubstadt","SpVgg Greuther Furth II","FV Illertissen","Eintracht Bamberg 2010","Schalding-Heining Passau","SpVgg Bayreuth","SpVgg Ansbach","TSV Buchbach","Viktoria Aschaffenburg","1. FC Schweinfurt 05","Vilzing","1. FC Norimberga II"]
 
 
list3 = []
for l1, l2 in zip(list1, list2):
    if fuzz.ratio(l1, l2) >= 35:
        #print(f'{l1} = {l2}')
        list3.append(f'{l1} = {l2}')
 
print(list3)

Output:
'Ansbach = Aubstadt', 'Buchbach = SpVgg Greuther Furth II', 'Aschaffenburg = FV Illertissen', 'Schweinfurt = Eintracht Bamberg 2010', 'Illertissen = Schalding-Heining Passau', 'Bamberg = SpVgg Bayreuth', 'Schalding = SpVgg Ansbach', ' Bayreuth = TSV Buchbach', 'Aubstadt = Viktoria Aschaffenburg', 'Furth II = 1. FC Schweinfurt 05'

can be done something for a better output? if i have a high value >= 75: no couple of words while if i decrise to 35 fails for these one!

**deanhystad** · (This post was last modified: Oct-27-2023, 05:36 PM by deanhystad.)

There are lots of matches with a score > 75. My guess is you are not comparing list1 and list2. My guess is you have different lists, and the lists are not ordered like the lists in your example.You probably cannot use zip, but rather need to compare all words in list1 to all words in list2. Like this:

random.shuffle(list2)
scores = [(fuzz.ratio(w1, w2), w1, w2) for w2 in list2 for w1 in list1]
print(*sorted(scores, reverse=True)[:30], sep="\n")

Output:(100, 'Vilzing', 'Vilzing')
(100, 'Augsburg II', 'Augsburg II')
(100, 'Aubstadt', 'Aubstadt')
(95, 'Memmingen', 'Memmingen ')
(90, 'Turkgucu Munchen', 'Turkgucu Munich')
(88, 'Illertissen', 'FV Illertissen')
(80, 'Buchbach', 'TSV Buchbach')
(78, ' Bayreuth', 'SpVgg Bayreuth')
(74, 'Burghausen', 'Wacker Burghausen')
(74, 'Aschaffenburg', 'Viktoria Aschaffenburg')
(72, 'Bayern II', 'Bayern Munich II')
(71, 'Schweinfurt', '1. FC Schweinfurt 05')
(70, 'Ansbach', 'SpVgg Ansbach')
(64, 'Nurnberg II', 'Augsburg II')

I saved the match score along with the words and sorted the list. Do something like this to help you set your threshold value.

Matching words like this will never be perfect.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	pandas pivot table: How to find count for each group in Index and Column	JaneTan	0	4,664	Oct-23-2021, 04:35 AM Last Post: JaneTan
	Generate a string of words for multiple lists of words in txt files in order.	AnicraftPlayz	2	4,100	Aug-11-2021, 03:45 PM Last Post: jamesaarr
	Sum similar items	tester_V	3	2,978	Jun-29-2021, 06:58 AM Last Post: tester_V
	Trying to find first 2 letter word in a list of words	Oldman45	7	6,095	Aug-11-2020, 08:59 AM Last Post: Oldman45
	Check text contains words similar to themes/topics (thesaurus)	Bec	1	52,005	Jul-28-2020, 04:17 PM Last Post: Larz60+
	Voynich search engine in python using dashes & dot totals to find Italian words	Pleiades	3	4,597	Oct-10-2019, 10:04 PM Last Post: Pleiades
	Create a function to find words of certain length	ag4g	2	5,658	Apr-21-2019, 06:20 PM Last Post: BillMcEnaney
	Python: if 'X' in 'Y' but with two similar strings as 'X'	DreamingInsanity	6	5,174	Feb-01-2019, 01:28 PM Last Post: buran
	Similar to Poker	bluekade5050	1	42,765	Nov-14-2018, 04:46 PM Last Post: j.crater
	Compare all words in input() to all words in file	Trianne	1	3,376	Oct-05-2018, 06:27 PM Last Post: ichabod801

find and group similar words with re?

User Panel Messages

Announcements