Python Forum
find and group similar words with re?
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
find and group similar words with re?
#1
If i have 2 list of names

for example
list1 ="Augsburg II, Turkgucu Munchen,Bayern II, Burghausen, Memmingen, Wurzburger Kickers, Ansbach, Buchbach,Aschaffenburg, Schweinfurt ,Illertissen, Bamberg,Schalding, Bayreuth ,Aubstadt, Furth II ,Vilzing, Nurnberg II "

list2 ="Augsburg II,Turkgucu Munich,Bayern Munich II,Wacker Burghausen,Memmingen ,Kickers Würzburg,Aubstadt ,SpVgg Greuther Furth II,FV Illertissen, Eintracht Bamberg 2010,Schalding-Heining Passau, SpVgg Bayreuth,SpVgg Ansbach,TSV Buchbach,Viktoria Aschaffenburg , 1. FC Schweinfurt,Vilzing , 1. FC Norimberga II"

is there some re command to have in output the words that are similar and not equal?

list3 ="Turkgucu Munchen = Turkgucu Munich, Bayern II =Bayern Munich II, Wurzburger Kickers= Kickers Würzburg ... and so on "

i was searching for commands:
re.search(pattern, string, flags=0)
re.search(pattern, sequence).group()
Reply
#2
The re module cannot do that. You could perhaps find specialized modules that help in Pypi, such as textdistance (untested)
Reply
#3
A similar library to what Gribouillis posted is TheFuzz(eailer called fuzzywuzzy).
Test.
from thefuzz import fuzz

list1 = ["Augsburg II", "Turkgucu Munchen", "Bayern II"]
list2 = ["Augburg II", "Turkgucu Munich", "Baye II"]
>>> fuzz.ratio(list1[0], list2[0])
95
>>> fuzz.ratio(list1[1], list2[1])
90
>>> fuzz.ratio(list1[2], list2[2])
88
Then can decided what ratio is ok to make it similar enuff,let say that choose 90.
from thefuzz import fuzz

list1 = ["Augsburg II", "Turkgucu Munchen", "Bayern II"]
list2 = ["Augburg II", "Turkgucu Munich", "Baye II"]


list3 = []
for l1, l2 in zip(list1, list2):
    if fuzz.ratio(l1, l2) >= 90:
        #print(f'{l1} = {l2}')
        list3.append(f'{l1} = {l2}')

print(list3)
Output:
['Augsburg II = Augburg II', 'Turkgucu Munchen = Turkgucu Munich']
cartonics likes this post
Reply
#4
from thefuzz import fuzz
 
list1 = ["Augsburg II","Turkgucu Munchen","Bayern II","Burghausen","Memmingen","Wurzburger Kickers","Ansbach","Buchbach","Aschaffenburg","Schweinfurt","Illertissen","Bamberg","Schalding"," Bayreuth","Aubstadt","Furth II","Vilzing","Nurnberg II"]
list2 = ["Augsburg II","Turkgucu Munich","Bayern Munich II","Wacker Burghausen","Memmingen ","Kickers Würzburg","Aubstadt","SpVgg Greuther Furth II","FV Illertissen","Eintracht Bamberg 2010","Schalding-Heining Passau","SpVgg Bayreuth","SpVgg Ansbach","TSV Buchbach","Viktoria Aschaffenburg","1. FC Schweinfurt 05","Vilzing","1. FC Norimberga II"]
 
 
list3 = []
for l1, l2 in zip(list1, list2):
    if fuzz.ratio(l1, l2) >= 35:
        #print(f'{l1} = {l2}')
        list3.append(f'{l1} = {l2}')
 
print(list3)
Output:
'Ansbach = Aubstadt', 'Buchbach = SpVgg Greuther Furth II', 'Aschaffenburg = FV Illertissen', 'Schweinfurt = Eintracht Bamberg 2010', 'Illertissen = Schalding-Heining Passau', 'Bamberg = SpVgg Bayreuth', 'Schalding = SpVgg Ansbach', ' Bayreuth = TSV Buchbach', 'Aubstadt = Viktoria Aschaffenburg', 'Furth II = 1. FC Schweinfurt 05'
can be done something for a better output? if i have a high value >= 75: no couple of words while if i decrise to 35 fails for these one!
Reply
#5
There are lots of matches with a score > 75. My guess is you are not comparing list1 and list2. My guess is you have different lists, and the lists are not ordered like the lists in your example.You probably cannot use zip, but rather need to compare all words in list1 to all words in list2. Like this:
random.shuffle(list2)
scores = [(fuzz.ratio(w1, w2), w1, w2) for w2 in list2 for w1 in list1]
print(*sorted(scores, reverse=True)[:30], sep="\n")
Output:
(100, 'Vilzing', 'Vilzing') (100, 'Augsburg II', 'Augsburg II') (100, 'Aubstadt', 'Aubstadt') (95, 'Memmingen', 'Memmingen ') (90, 'Turkgucu Munchen', 'Turkgucu Munich') (88, 'Illertissen', 'FV Illertissen') (80, 'Buchbach', 'TSV Buchbach') (78, ' Bayreuth', 'SpVgg Bayreuth') (74, 'Burghausen', 'Wacker Burghausen') (74, 'Aschaffenburg', 'Viktoria Aschaffenburg') (72, 'Bayern II', 'Bayern Munich II') (71, 'Schweinfurt', '1. FC Schweinfurt 05') (70, 'Ansbach', 'SpVgg Ansbach') (64, 'Nurnberg II', 'Augsburg II')
I saved the match score along with the words and sorted the list. Do something like this to help you set your threshold value.

Matching words like this will never be perfect.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  pandas pivot table: How to find count for each group in Index and Column JaneTan 0 3,323 Oct-23-2021, 04:35 AM
Last Post: JaneTan
  Generate a string of words for multiple lists of words in txt files in order. AnicraftPlayz 2 2,822 Aug-11-2021, 03:45 PM
Last Post: jamesaarr
  Sum similar items tester_V 3 1,986 Jun-29-2021, 06:58 AM
Last Post: tester_V
  Trying to find first 2 letter word in a list of words Oldman45 7 3,775 Aug-11-2020, 08:59 AM
Last Post: Oldman45
  Check text contains words similar to themes/topics (thesaurus) Bec 1 32,098 Jul-28-2020, 04:17 PM
Last Post: Larz60+
  Voynich search engine in python using dashes & dot totals to find Italian words Pleiades 3 3,519 Oct-10-2019, 10:04 PM
Last Post: Pleiades
  Create a function to find words of certain length ag4g 2 4,094 Apr-21-2019, 06:20 PM
Last Post: BillMcEnaney
  Python: if 'X' in 'Y' but with two similar strings as 'X' DreamingInsanity 6 3,880 Feb-01-2019, 01:28 PM
Last Post: buran
  Similar to Poker bluekade5050 1 32,871 Nov-14-2018, 04:46 PM
Last Post: j.crater
  Compare all words in input() to all words in file Trianne 1 2,776 Oct-05-2018, 06:27 PM
Last Post: ichabod801

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020