Posts: 4,560
Threads: 1,464
Joined: Sep 2016
Dec-10-2016, 05:57 AM
(This post was last modified: Dec-10-2016, 05:57 AM by Skaperen.)
given a list of words, some words are double words, such as " cheesecloth". your program is to output all double words it encounters which are made of two words in that same list. the output double words are to be output in the same order as they are in the list, which may not be in sorted order. so cheesecloth is a double word only if that list contains cheese and cloth.
Tradition is peer pressure from dead people
What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Posts: 4,229
Threads: 97
Joined: Sep 2016
Is 'foofoo' a double word if 'foo' is in the list?
Posts: 4,560
Threads: 1,464
Joined: Sep 2016
(Dec-10-2016, 06:03 AM)ichabod801 Wrote: Is 'foofoo' a double word if 'foo' is in the list? good question
i would say that fits the definition.
Tradition is peer pressure from dead people
What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Posts: 2,164
Threads: 35
Joined: Sep 2016
No sample list with given or any indication of its size, this works with my small sample list.
import itertools
words = ('board', 'shotglass', 'cheese', 'glass', 'cabin', 'cloth', 'in',
'cup', 'cheesecloth', 'shot', 'cab', 'cupboard')
double_words = []
for word in itertools.permutations(words, 2):
new_word = ''.join(word)
if new_word in words:
double_words.append(new_word)
for word in words:
if word in double_words:
print('{} is a double word'.format(word)) Output: shotglass is a double word
cabin is a double word
cheesecloth is a double word
cupboard is a double word
Posts: 4,229
Threads: 97
Joined: Sep 2016
Mine is similar to Yoriz's, but using sets:
def double_words(word_list):
word_set = set(word_list)
doubles = set(''.join(pair) for pair in itertools.permutations(word_list, 2) if ''.join(pair) in word_set)
return [word for word in word_list if word in doubles] It works on Yoriz's word list. I ran it on the full word list I normally use. It ran quite a while and found 15,876 double words.
Posts: 4,560
Threads: 1,464
Joined: Sep 2016
now, what if the word list is a huge file you can read one word at a time. what if memory cannot hold the whole list in any form? can you handle that case with a fixed number of passes? can you make it scale O(number_of_words) ?
Tradition is peer pressure from dead people
What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Posts: 687
Threads: 37
Joined: Sep 2016
(Dec-10-2016, 02:20 PM)ichabod801 Wrote: Mine is similar to Yoriz's, but using sets:
def double_words(word_list):
word_set = set(word_list)
doubles = set(''.join(pair) for pair in itertools.permutations(word_list, 2) if ''.join(pair) in word_set)
return [word for word in word_list if word in doubles] It works on Yoriz's word list. I ran it on the full word list I normally use. It ran quite a while and found 15,876 double words.
No time for code, but:
- build a set of all the words
- scan the llist again, and for each word iterate x see if word[:x] is in the set, and if so if word[x:] is also in the set.
No need for the N**2 set. Complexity is O(N*L) where L is the average half-length but since it's practically a constant for large sets of random words, result must be O(N).
Unless noted otherwise, code in my posts should be understood as "coding suggestions", and its use may require more neurones than the two necessary for Ctrl-C/Ctrl-V.
Your one-stop place for all your GIMP needs: gimp-forum.net
Posts: 4,560
Threads: 1,464
Joined: Sep 2016
what i am looking for is how to handle cases where reading the dictionary file into a dictionary or list or set runs out of memory. sure it is easy if you can have a set of all words in memory. i tried it, writing some quick code and it goes very fast. the memory efficiency is just a personal interest. these days, memory is cheap; my laptop is 16GB and my server is 64GB which way more than enough to do this, even with a dozen VMs running (none running Windows).
Tradition is peer pressure from dead people
What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Posts: 11,875
Threads: 474
Joined: Sep 2016
There are some possible solutions here
Posts: 2,344
Threads: 62
Joined: Sep 2016
Personally, I would have tended toward Ofnuts' solution. With the memory constraint, things get much trickier. I don't believe it is possible in linear time without sufficient memory, am even more confident it's impossible if you add a single-pass constraint, and to try to do our best would require further clarification.
Assuming we can write to disk, I'd - use some external sorting method on tuples of the form (the_word, its_line_number) to produce a new file sorted_words.txt
- iterate through that, keeping track of the "previous" word(s) which are a prefix of the current line; a prefix candidate can be dropped as soon as the current iterated word doesn't have it as a prefix
- while iterating, generate a list of suffixes (tuples of the form (suffix, line_number)) which I would dump to a file, suffixes.txt, if memory is not enough for it
- sort that suffixes list (externally if needed, in-memory if possible)
- iterate through the suffix list[1] and sorted_word.txt file in parallel, but not with zip, advancing forward the one which had the "older" word and storing line_number when the suffix is seen to be in the sorted words. Same deal with the list of line numbers, store in memory if possible, go to disk when needed, and ultimately sort it
- again, do the parallel iteration thing with the input list and line numbers, printing out the "double words"
Here are some of my assumptions: - Disk accesses are expensive
- Disk writes cost more than disk reads
- Accessing a file on disk is cheaper when done contiguously
- Python or the OS, when iterating through a file, will do smart things about reading ahead and such, to take advance of the assumptions above; if not, I'd have to write something to do that myself
There are probably better ways, like a bloom filter might be used, but this is how I initially think to solve this... it certainly wouldn't be trivial to code up, especially with each little improvement. You could also try to squeeze the memory further, if you know that you have ASCII for example you could reduce your memory usage by trying to ignore that extra bit.
I might be over-complicating things here, but I think (not having done any math on it) that this would use fewer disk accesses than something like building an on-disk set for Ofnuts' solution.
[1] If the suffixes fit in memory, since I was adding them to a list, when sorting I'd do it reversed and pop() from the list, thereby decreasing the memory use while iterating. If it was in a file, I'd already be using minimal memory so not do anything super special here.
|