Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Brute Forcing Anagrams
#1
Greetings,

I am part the way through my Online Python course. I started learning Python for one very specific use-case and I thought maybe it would be smart to ask the forums about my use-case so I can focus my Python attentions more strategically.

My use case is this...

I have a scrambled code made up of 300 English letters. It takes the form of a Word Search puzzle, so it's format is 20 columns by 15 rows. However, it is more than just a simple wordsearch. Hidden in this 300 letters are presumed to be various anagrams that produce an English message.

Now, given the number of characters we're working with, it is very easy to find LOTS of words in each row of letters or column of letters.

However, I believe the words I am looking for will be specific . So, I have spent time building myself a 'Word-List'. Right now it has approximately 1,000 words on it.

Here is what I want to code... I want to be able to tell my code that [it] can arrange each row of letters (15 rows total) of 20 letters each(20 columns total) in any order that it wants - in an attempt to match a word on my Word-List. Check horizontal rows for the word, then check vertical columns for the word. IF a match is found, make a note. ELSE, no match is found, move on to the next word on the word list.

So it would ultimately check the horizontal lines h1-h15, then check the vertical lines v1-v20. Once all rows and all columns have been checked for that word, move on to the next word on the list.

I believe my list will continue to grow, but shouldn't exceed more than a few thousand words on my dictionary word-list. Again, these are specific words for specific topics that I believe are relevant to the puzzle I'm solving.

So since I am learning Python, how should I go about coding this, and what 'Python-coding-topics' should I focus my attentions on.

Appreciate you all for reading this far and look forward to your Yoda guidance!

Thank You.
Reply
#2
[Image: 1Qb1300]
Here is an example of my scrambled word search. As an example, one of the words on my word-list is QUARTER.
So while examining horizontal line 13, it would find a match for that word, make a note, and continue on.
Hopefully this task makes sense as I could really use some guidance about coding this.
Reply
#3
You are probably not the first person to ever have wanted to do this. There are, as I see it, 2 stages:

1. Making
2. Solving

If you grasp the making, the solving will come easier, a kind of reverse-engineering of the creation. That said, resolving the words you found into sensible English may be tricky. I don't think word-search puzzles usually contain a readable text, unless you have a sensible criterion for the word order.

A very quick search found this page about how to create such puzzles. There are probably many more examples out there.
Reply
#4
(Jan-07-2025, 10:38 PM)Anorak Wrote: Hidden in this 300 letters are presumed to be various anagrams that produce an English message.

However, I believe the words I am looking for will be specific . So, I have spent time building myself a 'Word-List'. Right now it has approximately 1,000 words on it.

Here is what I want to code... I want to be able to tell my code that [it] can arrange each row of letters (15 rows total) of 20 letters each(20 columns total) in any order that it wants - in an attempt to match a word on my Word-List. Check horizontal rows for the word, then check vertical columns for the word. IF a match is found, make a note. ELSE, no match is found, move on to the next word on the word list.

Too much ambiguity. If you can't describe problem it's very seldom get solved. Conditions like "Presumed to be various anagrams", "I believe the words I am looking for will be specific" are very hard to put into any code.

Take your time and write what you want to achieve. If it's defined only then it's time to start to think how to do it. Based on information provided I can assume that one example could be:

- I want to find anagrams of words in my words dictionary
- I want find them from rows and columns of matrix
- Anagrams of words can be overlapping
- No check for anagrams in diagonals
I'm not 'in'-sane. Indeed, I am so far 'out' of sane that you appear a tiny blip on the distant coast of sanity. Bucky Katt, Get Fuzzy

Da Bishop: There's a dead bishop on the landing. I don't know who keeps bringing them in here. ....but society is to blame.
Reply
#5
Maybe this helps. I'm unsure if is_anagram is right. Use sorted to sort the words by length. Then use itertools.groupby to group words with equal length. Then make the product of it and with repeat=2 you get two items of the product. Of course this can take very long, if the wordlist is too long. I think the complexity is quadratic.

from pathlib import Path
from itertools import groupby, product


def is_anagram(word_a, word_b):
    """
    Should return True, if the word_a is an anagram of word_b.
    """
    # strip word_a and word_b and convert to lowercase 
    word_a = word_a.strip().lower()
    word_b = word_b.strip().lower()

    # compare word length and if all characters are the same set
    # if this is not the case, it isn't an anagram
    if len(word_a) != len(word_b) or set(word_a) != set(word_b):
        return False
    
    # distance must be 2, if it's an anagram
    distance = 0

    # checking char by char and if they are different, then add 1 to difference
    for char_a, char_b in zip(word_a, word_b):
        if char_a != char_b:
            distance += 1

            # early exit to save time
            if distance > 2:
                return False
    
    return distance == 2


def get_words(file):
    with open(file) as fd:
        # stripping whitespaces and newline
        # iterating line by line over the file
        # sort the words by length
        by_len = sorted(map(str.strip, fd), key=len)

        # this part takes very long, if you have a big dictionary
        for group, items in groupby(by_len, key=len):
            for word_a, word_b in product(items, repeat=2):
                if is_anagram(word_a, word_b):
                    yield (word_a, word_b)



# german collection of words
file = Path.home().joinpath("Desktop", "wörter.txt")


for anagram in get_words(file):
    print(anagram)
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Reply
#6
I'm going to dig into this and see if I can find useful pieces from the code your provided.

Right now my best idea is to convert each row or each column of characters into a length count. So horizontal line 1 (h1) might be read as h1= 2a, 1b, 1c, 4e, etc.. Counting each letter and removing the nulls. WOULD I SAVE THESE AS LISTS?

Then I need to utilize the same function on each next word on the Word-List, so the word QUARTER might be read as word_1 = 1a, 1e, 1q, 2r, 1t, 1u.

Then I need this to be compared to each line (FOR EACH?) h1-h15, and v1-v20. Looking for any lines that can accommodate the word_1 character count in full.

If my word_1 has 2 R's, and the v3 line for example, has 3 R's, I obviously want to consider that a match, So I'm assuming I need to tell it 'equal to or greater than' when comparing word_1 letter counts to each subsequent line.

Sorry to the replies that said I'm being too vague. Hopefully this additional explanation is more clear what my initial game plan is looking like.
Reply
#7
As a dictionary {'a': 1, 'e': 1, 'q': 1, 'r': 2, 't': 1, 'u': 1}

You could have additional information:
{'a': 1, 'e': 1, 'q': 1, 'r': 2, 't': 1, 'u': 1, 'len': 7, 'word': 'quarter'}
Anorak likes this post
Reply
#8
(Jan-08-2025, 05:02 PM)Anorak Wrote: Looking for any lines that can accommodate the word_1 character count in full.
/.../
Sorry to the replies that said I'm being too vague. Hopefully this additional explanation is more clear what my initial game plan is looking like.

Does it mean that "anagrams" are not from consecutive characters and you look for them in the characters of the whole row/column?

I decided to find "classic anagrams" (i.e. sequence of continuous characters) with brute force. For that:

- I grabbed initial data
- downloaded list of english words from Github (containing 370 104 words)
- made indices starting from length of two (i.e ignoring one letter words) for row length of 20 characters for making slices
- made slices using indices and made sorted strings out of them (for comparison later)
- read data from file to dictionary where key is string with sorted characters and values are words which can be made from key
- iterated over strings made from slices and and added values from dictionary if matched with key

Actual code has probably less characters than the description of it. Didn't do any tests so no guarantees that result is correct. Just to demonstrate one possible brute force approach (only for rows):

data = ("qvucalilenineteenbma", "cretehtreaoucsifevle", "oroundhnimihstoliaro",
       "uhtronroeocehsrardwo", "noubtdeloeltlntretaw", "tetatseonyneoaolineo",
       "pmuehttsewodahreeret", "eonskaeptoooilpasiec", "lesdbnooraooevlewtsr",
       "entufivetntncoiujtso", "vrniemaklawescodtrid", "eadtteyossquareeerht",
       "nmtopleceohetquarryo", "nsiosmixzfdqbccasliy", "apsorezngqzpsapphire")

indices = [(start, end) for start in range(20) for end in range(start+2, 21)]

slices = [''.join(sorted(row[begin:stop])) for row in data for (begin, stop) in indices]

mapping = dict()

with open("words_alpha.txt", "r") as f:
    for row in f:
        word = row.strip()
        value = ''.join(sorted(word))
        mapping.setdefault(value, set()).add(word) 

texts = [mapping[item] for item in slices if item in mapping]
anagrams = set(word for text in texts for word in text)

# found 1407 words which can be made from consecutive characters in rows
# took around 1 sec on my machine
# longest word is 14 characters - heterochronous - meaning occurring at different times or stages
# words longer than 7 characters below, BTW meaning of first word is "mineral consisting of a vermiculit" :-); 
 
lennilite 
chaetiferous
haliotis
outreach
unbolted
nineteen
doubleton
complete
outraces
reheater
cordites
etherate
junction
papisher
sapphire
alkamine
horsecar
cosharer
feracious
heterochronous
lotharios
rheotron
decorist
isolator
outreaches
drawhorse
historial
doblones
rechoose
racehorse
bradoons
homilist
sectroid
othinism
ostiolar
trawlnet
chresard
ostentate
cotunnite
doubletone
leninite
outbleed
leucaniline
shoreward
readhere
isoapiole
contemple
shadower
outsearch
heterodera
towheads
andorobo
emplecton
bandolero
Anorak likes this post
I'm not 'in'-sane. Indeed, I am so far 'out' of sane that you appear a tiny blip on the distant coast of sanity. Bucky Katt, Get Fuzzy

Da Bishop: There's a dead bishop on the landing. I don't know who keeps bringing them in here. ....but society is to blame.
Reply
#9
(Jan-09-2025, 06:09 PM)perfringo Wrote:
(Jan-08-2025, 05:02 PM)Anorak Wrote: Looking for any lines that can accommodate the word_1 character count in full.
/.../
Sorry to the replies that said I'm being too vague. Hopefully this additional explanation is more clear what my initial game plan is looking like.

Does it mean that "anagrams" are not from consecutive characters and you look for them in the characters of the whole row/column?

I decided to find "classic anagrams" (i.e. sequence of continuous characters) with brute force. For that:

- I grabbed initial data
- downloaded list of english words from Github (containing 370 104 words)
- made indices starting from length of two (i.e ignoring one letter words) for row length of 20 characters for making slices
- made slices using indices and made sorted strings out of them (for comparison later)
- read data from file to dictionary where key is string with sorted characters and values are words which can be made from key
- iterated over strings made from slices and and added values from dictionary if matched with key

Actual code has probably less characters than the description of it. Didn't do any tests so no guarantees that result is correct. Just to demonstrate one possible brute force approach (only for rows):

data = ("qvucalilenineteenbma", "cretehtreaoucsifevle", "oroundhnimihstoliaro",
       "uhtronroeocehsrardwo", "noubtdeloeltlntretaw", "tetatseonyneoaolineo",
       "pmuehttsewodahreeret", "eonskaeptoooilpasiec", "lesdbnooraooevlewtsr",
       "entufivetntncoiujtso", "vrniemaklawescodtrid", "eadtteyossquareeerht",
       "nmtopleceohetquarryo", "nsiosmixzfdqbccasliy", "apsorezngqzpsapphire")

indices = [(start, end) for start in range(20) for end in range(start+2, 21)]

slices = [''.join(sorted(row[begin:stop])) for row in data for (begin, stop) in indices]

mapping = dict()

with open("words_alpha.txt", "r") as f:
    for row in f:
        word = row.strip()
        value = ''.join(sorted(word))
        mapping.setdefault(value, set()).add(word) 

texts = [mapping[item] for item in slices if item in mapping]
anagrams = set(word for text in texts for word in text)

# found 1407 words which can be made from consecutive characters in rows
# took around 1 sec on my machine
# longest word is 14 characters - heterochronous - meaning occurring at different times or stages
# words longer than 7 characters below, BTW meaning of first word is "mineral consisting of a vermiculit" :-); 
 
lennilite 
chaetiferous
haliotis
outreach
unbolted
nineteen
doubleton
complete
outraces
reheater
cordites
etherate
junction
papisher
sapphire
alkamine
horsecar
cosharer
feracious
heterochronous
lotharios
rheotron
decorist
isolator
outreaches
drawhorse
historial
doblones
rechoose
racehorse
bradoons
homilist
sectroid
othinism
ostiolar
trawlnet
chresard
ostentate
cotunnite
doubletone
leninite
outbleed
leucaniline
shoreward
readhere
isoapiole
contemple
shadower
outsearch
heterodera
towheads
andorobo
emplecton
bandolero

Where are the matches being output? I ran this code on pycharm, but I can't find a list of matches like those you've provided. I'm sure I'm missing something... When I run it Pycham simply returns "Process finished with exit code 0" which I think just means it completed its task.
Reply
#10
(Jan-09-2025, 09:26 PM)Anorak Wrote: Where are the matches being output?

There is no print in this code so nothing is outputted. However, anagrams found by this code are stored in helpfully named anagrams. You can print them out or do whatever you want with them.
I'm not 'in'-sane. Indeed, I am so far 'out' of sane that you appear a tiny blip on the distant coast of sanity. Bucky Katt, Get Fuzzy

Da Bishop: There's a dead bishop on the landing. I don't know who keeps bringing them in here. ....but society is to blame.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Need an alternative to brute force optimization loop jmbonni 5 3,103 Jun-05-2024, 02:21 AM
Last Post: stevejobb
  Forcing matplotlib to NOT use scientific notation when graphing sawtooth500 4 5,015 Mar-25-2024, 03:00 AM
Last Post: sawtooth500
  Solving an equation by brute force within a range alexfrol86 3 4,183 Aug-09-2022, 09:44 AM
Last Post: Gribouillis
Question Numeric Anagrams - Count Occurances monty024 2 2,226 Nov-13-2021, 05:05 PM
Last Post: monty024
  How to use scipy.optimization.brute for multivariable function Shiladitya 9 8,844 Oct-28-2020, 10:40 PM
Last Post: scidam
  I need advise with developing a brute forcing script fatjuicypython 11 8,390 Aug-21-2020, 09:20 PM
Last Post: Marbelous
  Forcing input from pre-defined list. scotty501 11 8,182 Jun-18-2019, 01:49 PM
Last Post: scotty501
  Why doesn't gc delete an object without forcing a garbage collection call? AlekseyPython 5 4,973 Mar-19-2019, 02:10 AM
Last Post: micseydel
  Password Brute Force 2skywalkers 9 7,385 Oct-18-2018, 02:35 PM
Last Post: buran
  Brute Force Password Guesser 2skywalkers 1 3,921 Oct-05-2018, 08:04 PM
Last Post: ichabod801

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020