Python Forum
programming challenge: double words
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
programming challenge: double words
#1
given a list of words, some words are double words, such as "cheesecloth".  your program is to output all double words it encounters which are made of two words in that same list.  the output double words are to be output in the same order as they are in the list, which may not be in sorted order.  so cheesecloth is a double word only if that list contains cheese and cloth.
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply
#2
Is 'foofoo' a double word if 'foo' is in the list?
Craig "Ichabod" O'Brien - xenomind.com
I wish you happiness.
Recommended Tutorials: BBCode, functions, classes, text adventures
Reply
#3
(Dec-10-2016, 06:03 AM)ichabod801 Wrote: Is 'foofoo' a double word if 'foo' is in the list?
good question Heart


i would say that fits the definition.
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply
#4
No sample list with given or any indication of its size, this works with my small sample list.
import itertools

words = ('board', 'shotglass', 'cheese', 'glass', 'cabin', 'cloth', 'in',
         'cup', 'cheesecloth', 'shot', 'cab', 'cupboard')
double_words = []

for word in itertools.permutations(words, 2):
    new_word = ''.join(word)
    if new_word in words:
        double_words.append(new_word)

for word in words:
    if word in double_words:
        print('{} is a double word'.format(word))
Output:
shotglass is a double word cabin is a double word cheesecloth is a double word cupboard is a double word
Reply
#5
Mine is similar to Yoriz's, but using sets:

def double_words(word_list):
    word_set = set(word_list)
    doubles = set(''.join(pair) for pair in itertools.permutations(word_list, 2) if ''.join(pair) in word_set)
    return [word for word in word_list if word in doubles]
It works on Yoriz's word list. I ran it on the full word list I normally use. It ran quite a while and found 15,876 double words.
Craig "Ichabod" O'Brien - xenomind.com
I wish you happiness.
Recommended Tutorials: BBCode, functions, classes, text adventures
Reply
#6
now, what if the word list is a huge file you can read one word at a time. what if memory cannot hold the whole list in any form? can you handle that case with a fixed number of passes? can you make it scale O(number_of_words) ?
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply
#7
(Dec-10-2016, 02:20 PM)ichabod801 Wrote: Mine is similar to Yoriz's, but using sets:

def double_words(word_list):
    word_set = set(word_list)
    doubles = set(''.join(pair) for pair in itertools.permutations(word_list, 2) if ''.join(pair) in word_set)
    return [word for word in word_list if word in doubles]
It works on Yoriz's word list. I ran it on the full word list I normally use. It ran quite a while and found 15,876 double words.

No time for code, but:

- build a set of all the words
- scan the llist again, and for  each word iterate x see if word[:x] is in the set, and if so if word[x:] is also in the set.

No need for the N**2 set. Complexity is O(N*L) where L is the average half-length but since it's practically a constant for large sets of random words, result must be O(N).
Unless noted otherwise, code in my posts should be understood as "coding suggestions", and its use may require more neurones than the two necessary for Ctrl-C/Ctrl-V.
Your one-stop place for all your GIMP needs: gimp-forum.net
Reply
#8
what i am looking for is how to handle cases where reading the dictionary file into a dictionary or list or set runs out of memory.  sure it is easy if you can have a set of all words in memory.   i tried it, writing some quick code and it goes very fast. the memory efficiency is just a personal interest.  these days, memory is cheap; my laptop is 16GB and my server is 64GB which way more than enough to do this, even with a dozen VMs running (none running Windows).
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply
#9
There are some possible solutions here
Reply
#10
Personally, I would have tended toward Ofnuts' solution. With the memory constraint, things get much trickier. I don't believe it is possible in linear time without sufficient memory, am even more confident it's impossible if you add a single-pass constraint, and to try to do our best would require further clarification.

Assuming we can write to disk, I'd
  • use some external sorting method on tuples of the form (the_word, its_line_number) to produce a new file sorted_words.txt
  • iterate through that, keeping track of the "previous" word(s) which are a prefix of the current line; a prefix candidate can be dropped as soon as the current iterated word doesn't have it as a prefix
  • while iterating, generate a list of suffixes (tuples of the form (suffix, line_number)) which I would dump to a file, suffixes.txt, if memory is not enough for it
  • sort that suffixes list (externally if needed, in-memory if possible)
  • iterate through the suffix list[1] and sorted_word.txt file in parallel, but not with zip, advancing forward the one which had the "older" word and storing line_number when the suffix is seen to be in the sorted words. Same deal with the list of line numbers, store in memory if possible, go to disk when needed, and ultimately sort it
  • again, do the parallel iteration thing with the input list and line numbers, printing out the "double words"
Here are some of my assumptions:
  • Disk accesses are expensive
  • Disk writes cost more than disk reads
  • Accessing a file on disk is cheaper when done contiguously
  • Python or the OS, when iterating through a file, will do smart things about reading ahead and such, to take advance of the assumptions above; if not, I'd have to write something to do that myself

There are probably better ways, like a bloom filter might be used, but this is how I initially think to solve this... it certainly wouldn't be trivial to code up, especially with each little improvement. You could also try to squeeze the memory further, if you know that you have ASCII for example you could reduce your memory usage by trying to ignore that extra bit.

I might be over-complicating things here, but I think (not having done any math on it) that this would use fewer disk accesses than something like building an on-disk set for Ofnuts' solution.

[1] If the suffixes fit in memory, since I was adding them to a list, when sorting I'd do it reversed and pop() from the list, thereby decreasing the memory use while iterating. If it was in a file, I'd already be using minimal memory so not do anything super special here.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Where is a "Challenge" forum? wavic 2 3,639 Nov-08-2016, 11:45 AM
Last Post: Kebap

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020