Efficiently find words

sumithar · (This post was last modified: Feb-07-2017, 04:36 PM by snippsat.)

This is my problem statement. American states are identified by 2 letter abbreviations. There are 50 such. Find the combination of abbreviations that form valid English words.
Find valid 4 letter, 6 letter, 8 letter and 10 letter words. No abbreviation should be used more than once in a word.

Below is the code I used for finding 10 letter words. The others are all very similar. The script ran pretty quick for 4 and 6 letter words, took almost 40 minutes for 8 letter words and for 10 letter words I let it run for 2 hours and it still didn't complete.

words.txt is a file I downloaded from here

svnwebdotfreebsddotorgslashcsrgslashshareslashdictslashwordsquestionmarkrevision=61569ampersandview=co

Would appreciate your insight on how to speed up. I think it's almost all CPU and not much of i/o- it is executing the 2 if's 50^5 times, right?

        
          
          
              
              states1 = ["AL","AK","AZ","AR","CA","CO","CT","DE","FL","GA","HI","ID","IL","IN","IA","KS","KY","LA","ME","MD","MA","MI","MN","MS","MO","MT","NE","NV","NH","NJ","NM","NY","NC","ND","OH","OK","OR","PA","RI","SC","SD","TN","TX","UT","VT","VA","WA","WV","WI","WY"]
 
states2 = ["AL","AK","AZ","AR","CA","CO","CT","DE","FL","GA","HI","ID","IL","IN","IA","KS","KY","LA","ME","MD","MA","MI","MN","MS","MO","MT","NE","NV","NH","NJ","NM","NY","NC","ND","OH","OK","OR","PA","RI","SC","SD","TN","TX","UT","VT","VA","WA","WV","WI","WY"]
states3 = ["AL","AK","AZ","AR","CA","CO","CT","DE","FL","GA","HI","ID","IL","IN","IA","KS","KY","LA","ME","MD","MA","MI","MN","MS","MO","MT","NE","NV","NH","NJ","NM","NY","NC","ND","OH","OK","OR","PA","RI","SC","SD","TN","TX","UT","VT","VA","WA","WV","WI","WY"]
states4 = ["AL","AK","AZ","AR","CA","CO","CT","DE","FL","GA","HI","ID","IL","IN","IA","KS","KY","LA","ME","MD","MA","MI","MN","MS","MO","MT","NE","NV","NH","NJ","NM","NY","NC","ND","OH","OK","OR","PA","RI","SC","SD","TN","TX","UT","VT","VA","WA","WV","WI","WY"]
states5 = ["AL","AK","AZ","AR","CA","CO","CT","DE","FL","GA","HI","ID","IL","IN","IA","KS","KY","LA","ME","MD","MA","MI","MN","MS","MO","MT","NE","NV","NH","NJ","NM","NY","NC","ND","OH","OK","OR","PA","RI","SC","SD","TN","TX","UT","VT","VA","WA","WV","WI","WY"]
 
 
wordFile = open('words.txt')
temp = wordFile.read().splitlines()
outputFile = open('outfile10LetterWords.txt', 'w')
 
for state1 in states1:
    for state2 in states2:
        for state3 in states3:
            for state4 in states4:
                for state5 in states5:
                    if (state5 != state4 and state5 != state3 and state5 !=state2 and state5 != state1   
                        and state4 != state3 and state4 !=state2 and state4 != state1
                        and state3 != state2 and state3 !=state1
                        and state2 != state1):
                        word = state1 + state2 + state3 + state4 + state5
                        if word.lower() in temp:
                            outputFile.write(word + '\n')
outputFile.close()

            

        
      

**buran** · (This post was last modified: Feb-07-2017, 04:48 PM by buran.)

to generate 10 letter words (254,251,200):

        
              import itertools
 
states =["AL","AK","AZ","AR","CA","CO","CT","DE","FL","GA",
         "HI","ID","IL", "IN","IA","KS","KY","LA","ME","MD",
         "MA","MI","MN","MS","MO","MT","NE","NV","NH","NJ",
         "NM","NY","NC","ND","OH","OK","OR","PA","RI","SC",
         "SD","TN","TX", "UT","VT","VA","WA","WV","WI","WY"]
 
for item in itertools.permutations(states,5):
    print ''.join(item)

However I would take different approach - assuming the words in the list file are much less - I will take the 10-letter words from the list, split it in 2-letters and check that (i) they don't repeat, (ii) all of them are present in the states list

sumithar · Feb-07-2017, 07:16 PM

(Feb-07-2017, 04:41 PM)buran Wrote: to generate 10 letter words (254,251,200):

1
2
3
4
5
6
7
8
9
10

import itertools

states =["AL","AK","AZ","AR","CA","CO","CT","DE","FL","GA",
         "HI","ID","IL", "IN","IA","KS","KY","LA","ME","MD",
         "MA","MI","MN","MS","MO","MT","NE","NV","NH","NJ",
         "NM","NY","NC","ND","OH","OK","OR","PA","RI","SC",
         "SD","TN","TX", "UT","VT","VA","WA","WV","WI","WY"]

for item in itertools.permutations(states,5):
    print ''.join(item)

However I would take different approach - assuming the words in the list file are much less - I will take the 10-letter words from the list, split it in 2-letters and check that (i) they don't repeat, (ii) all of them are present in the states list

I had no idea of itertools but that code while simpler didn't perform all that well but that different approach worked like a charm- it runs in fractions of a second now.

**buran** · Feb-07-2017, 07:34 PM

(Feb-07-2017, 07:16 PM)sumithar Wrote: that code while simpler didn't perform all that well

well - the itertools is optimized and provides functions for creating iterators for efficient looping. I don't see how it could be less efficient than your code. my snippets runs 15 mins to generate all permutations.

***Ofnuts*** · Feb-07-2017, 10:15 PM

In an English words list I have there are 161972 10-letter words. There are 50!/45!=254251200 permutations of 5 states names among 50. So its is faster to check that words in the list are made of state names than to check all permutations against the word list.

sumithar · Feb-08-2017, 01:29 AM

(Feb-07-2017, 07:34 PM)buran Wrote:
(Feb-07-2017, 07:16 PM)sumithar Wrote: that code while simpler didn't perform all that well
well - the itertools is optimized and provides functions for creating iterators for efficient looping. I don't see how it could be less efficient than your code. my snippets runs 15 mins to generate all permutations.

Sorry, didn't meant to say it was not as good or worse than my original effort! Just that it was still taking time. But as Ofnuts clarifies, your 2nd suggestion where I check to see if the 10 letter words in the words file (which in my case happens to be a mere 2000 or so) are made up of the same letters as the states is waaay more efficient. That's basically 2000 comparisons as it were.
My original effort and your modified approach using itertools were still doing 46x47x48x49x50 comparisons weren't they?

Efficiently find words

User Panel Messages

Announcements