given a list of words, some words are double words, such as "cheesecloth". your program is to output all double words it encounters which are made of two words in that same list. the output double words are to be output in the same order as they are in the list, which may not be in sorted order. so cheesecloth is a double word only if that list contains cheese and cloth.
Is 'foofoo' a double word if 'foo' is in the list?
(Dec-10-2016, 06:03 AM)ichabod801 Wrote: [ -> ]Is 'foofoo' a double word if 'foo' is in the list?
good question
i would say that fits the definition.
No sample list with given or any indication of its size, this works with my small sample list.
import itertools
words = ('board', 'shotglass', 'cheese', 'glass', 'cabin', 'cloth', 'in',
'cup', 'cheesecloth', 'shot', 'cab', 'cupboard')
double_words = []
for word in itertools.permutations(words, 2):
new_word = ''.join(word)
if new_word in words:
double_words.append(new_word)
for word in words:
if word in double_words:
print('{} is a double word'.format(word))
Output:
shotglass is a double word
cabin is a double word
cheesecloth is a double word
cupboard is a double word
Mine is similar to Yoriz's, but using sets:
def double_words(word_list):
word_set = set(word_list)
doubles = set(''.join(pair) for pair in itertools.permutations(word_list, 2) if ''.join(pair) in word_set)
return [word for word in word_list if word in doubles]
It works on Yoriz's word list. I ran it on the full word list I normally use. It ran quite a while and found 15,876 double words.
now, what if the word list is a huge file you can read one word at a time. what if memory cannot hold the whole list in any form? can you handle that case with a fixed number of passes? can you make it scale O(number_of_words) ?
(Dec-10-2016, 02:20 PM)ichabod801 Wrote: [ -> ]Mine is similar to Yoriz's, but using sets:
def double_words(word_list):
word_set = set(word_list)
doubles = set(''.join(pair) for pair in itertools.permutations(word_list, 2) if ''.join(pair) in word_set)
return [word for word in word_list if word in doubles]
It works on Yoriz's word list. I ran it on the full word list I normally use. It ran quite a while and found 15,876 double words.
No time for code, but:
- build a set of all the words
- scan the llist again, and for each word iterate
x
see if
word[:x]
is in the set, and if so if
word[x:]
is also in the set.
No need for the N**2 set. Complexity is O(N*L) where L is the average half-length but since it's practically a constant for large sets of random words, result must be O(N).
what i am looking for is how to handle cases where reading the dictionary file into a dictionary or list or set runs out of memory. sure it is easy if you can have a set of all words in memory. i tried it, writing some quick code and it goes very fast. the memory efficiency is just a personal interest. these days, memory is cheap; my laptop is 16GB and my server is 64GB which way more than enough to do this, even with a dozen VMs running (none running Windows).
There are some possible solutions
here
Personally, I would have tended toward Ofnuts' solution. With the memory constraint, things get much trickier. I don't believe it is possible in linear time without sufficient memory, am even more confident it's impossible if you add a single-pass constraint, and to try to do our best would require further clarification.
Assuming we can write to disk, I'd
- use some external sorting method on tuples of the form (the_word, its_line_number) to produce a new file sorted_words.txt
- iterate through that, keeping track of the "previous" word(s) which are a prefix of the current line; a prefix candidate can be dropped as soon as the current iterated word doesn't have it as a prefix
- while iterating, generate a list of suffixes (tuples of the form (suffix, line_number)) which I would dump to a file, suffixes.txt, if memory is not enough for it
- sort that suffixes list (externally if needed, in-memory if possible)
- iterate through the suffix list[1] and sorted_word.txt file in parallel, but not with zip, advancing forward the one which had the "older" word and storing line_number when the suffix is seen to be in the sorted words. Same deal with the list of line numbers, store in memory if possible, go to disk when needed, and ultimately sort it
- again, do the parallel iteration thing with the input list and line numbers, printing out the "double words"
Here are some of my assumptions:
- Disk accesses are expensive
- Disk writes cost more than disk reads
- Accessing a file on disk is cheaper when done contiguously
- Python or the OS, when iterating through a file, will do smart things about reading ahead and such, to take advance of the assumptions above; if not, I'd have to write something to do that myself
There are probably better ways, like a bloom filter might be used, but this is how I initially think to solve this... it certainly wouldn't be trivial to code up, especially with each little improvement. You could also try to squeeze the memory further, if you know that you have ASCII for example you could reduce your memory usage by trying to ignore that extra bit.
I might be over-complicating things here, but I think (not having done any math on it) that this would use fewer disk accesses than something like building an on-disk set for Ofnuts' solution.
[1] If the suffixes fit in memory, since I was adding them to a list, when sorting I'd do it reversed and pop() from the list, thereby decreasing the memory use while iterating. If it was in a file, I'd already be using minimal memory so not do anything super special here.