abstract data structures

Skaperen · (This post was last modified: Apr-18-2017, 04:38 AM by Skaperen.)

in my C programming i implemented a data structure that i regularly used, similar to how a dictionary is used in Python. but it had some additional features and methods. i have seen that in Python there are some more advanced classes are created, some of which are included in the Python package,itself. i am wondering if anything might have been created like what i did in C.

my avlmap project for/in C was basically a mapping on top of an AVL tree. this was implemented as a 2-layer class where a set of functions implemented an AVL tree and a set of functions used the AVL functions and implemented a data mapping on top of that (with some tree aspects remaining). when using the mapping layer, there was still the concept of a current location (kept in the AVL layer) and methods to step forward or reverse in addition to methods to find keys in the mapping (that change the current location). my avlmap code included macros to easily code iterators (this is in C). some additional features included the concept that the current location could be between nodes (data members) which would happen if a search failed ... this represented where the search key would be if it existed. then the caller could step forward or reverse from there. this was useful for many things such as floating point keys. another feature was the ability to insert duplicate data in a tree. among all elements/nodes with the same key. they were represented like a list that was appended to ... stepping forward stepped through duplicate keys in the same order as they were inserted.

i ended up using avlmap both like i would use a dictionary in Python as well as use a list in Python. in one app i even used an avlmap as a queue.

what i would be looking for in Python is a class that works similar to a dictionary, but allows duplicate keys, too. fuzzy searching would be a plus, too (given a key that may not exist get the next lower and higher keys that do exist).

**Larz60+** · Apr-18-2017, 08:36 AM

How can a key be a key if it's duplicated.
There must additional qualifier to make it unique, in which case it's no longer a duplicate,
but rather a complex key.

volcano63 · Apr-18-2017, 08:48 AM

collections.UserDict allows you to sub-class dictionary and override its default behavior. Nevertheless, as @Larz60+ has stipulated - multiple keys is not an option.
Of course, you can simulate that by hacking indices built-in behind the scene, and somehow iterate over those in __getitem__ and __setitem__ methods of your class (and some more) - but it may not be enough, and the resulting behavior may turn out unpredictable.

wavic · (This post was last modified: Apr-18-2017, 10:41 AM by wavic.)

You can subclass the dict type. And build something like this.

In [1]: from collections import defaultdict

In [2]: data = defaultdict(list)

In [3]: data = [[{k: k**2}] for k in range(1, 15)]

In [4]: data
Out[4]: 
[[{1: 1}],
 [{2: 4}],
 [{3: 9}],
 [{4: 16}],
 [{5: 25}],
 [{6: 36}],
 [{7: 49}],
 [{8: 64}],
 [{9: 81}],
 [{10: 100}],
 [{11: 121}],
 [{12: 144}],
 [{13: 169}],
 [{14: 196}]]

In [5]: for element in data:
    ...:     if element[0].get(3): # if there is key == 3
    ...:         element.append({3: 3*2})
    ...: 

In [6]: data
Out[6]: 
[[{1: 1}],
 [{2: 4}],
 [{3: 9}, {3: 6}],
 [{4: 16}],
 [{5: 25}],
 [{6: 36}],
 [{7: 49}],
 [{8: 64}],
 [{9: 81}],
 [{10: 100}],
 [{11: 121}],
 [{12: 144}],
 [{13: 169}],
 [{14: 196}]]

I think Python difflib will do the the fuzzy searching but I am not aware of it and can't write some examples right now.

Also, you may find this interesting too. Defaultdict again: https://gist.github.com/hrldcpr/2012250

***ichabod801*** · (This post was last modified: Apr-19-2017, 01:28 AM by ichabod801.)

I think this would be a subclass of OrderedDict. The AVL tree structure keeps order, but is otherwise only useful for efficient lookup. Dicts provide the efficient lookup, just with a hash table instead of a tree structure. Then you would just need to worry about insertion order for distinguishing duplicate keys. So there would be ostensible keys and actual keys. If 'spam' is the ostensible key, then the actual key would be (1, 'spam'), or maybe (0, 'spam') for the first insertion of that key. You would just need to override the getitem and setitem to handle the ostensible vs. actual keys. Although it's not clear how you would reassign a key. Frex:

mo_dict['spam'] = 'spam'
mo_dict['spam'] = 'eggs'

Would result in something like {(0, 'spam'): 'spam', (1, 'spam'): 'eggs'}. It's not clear how you would change the value for the actual key (0, 'spam'). Maybe that's not something this data structure is meant to do? And what does mo_dict['spam'] return? ['spam', 'eggs']?

If that was all clarified, than the other features could be implemented pretty easily.

Edit: No, OrderedDict wouldn't work, since it uses insertion order. I guess you would have to implement the tree structure natively, with the extra features. Not that trees are hard to implement in Python.

Skaperen · (This post was last modified: Apr-19-2017, 03:33 AM by Skaperen.)

(Apr-18-2017, 08:36 AM)Larz60+ Wrote: How can a key be a key if it's duplicated.
There must additional qualifier to make it unique, in which case it's no longer a duplicate,
but rather a complex key.

i simply made the logic/code add a new node to the AVL tree. a scan of nodes simply went through both nodes. by carefully controlling the order of the tree search i can be sure a search finds the first of the group of nodes with the same key. it is possible to do it to get the last one instead. the forward/reverse steps work strictly on the pointer structure of the tree. once the current position (a node pointer) is at the first of duplicate(s) then stepping forward moves to the next node and it has the same key. implementing duplicate keys can make a hash table based mapping more complex, but not a binary search tree like AVL since the construction of a BST does not involve the keys.

you could make a binary tree (note the lack of "search") with nodes that do not have keys at all. the way a key is used in a BST affects whether the code chooses to go "left" or "right" (lower or higher) when adding new nodes or searching a node. i simply make a duplicate key always go higher which results in the order of the group of duplicates being the order of insert in time sequence. my implementation has different code to do this depending on whether the caller wants to allow duplicates or not (at insertion time) for optimal performance (which is a frequent goal in C programming).

(Apr-18-2017, 08:48 AM)volcano63 Wrote: collections.UserDict allows you to sub-class dictionary and override its default behavior. Nevertheless, as @Larz60+ has stipulated - multiple keys is not an option.
Of course, you can simulate that by hacking indices built-in behind the scene, and somehow iterate over those in __getitem__ and __setitem__ methods of your class (and some more) - but it may not be enough, and the resulting behavior may turn out unpredictable.

its not an option in dict or set to allow implementations (for Python) by hash tree and such. a BST implementation can do it ... i did it with AVL (a kind of BST). but a hash tree can do searches faster than a BST.

simulating it may be how it needs to be done to have it in Python. i'd probably just do a dict of lists.

(Apr-19-2017, 01:28 AM)ichabod801 Wrote: I think this would be a subclass of OrderedDict. The AVL tree structure keeps order, but is otherwise only useful for efficient lookup. Dicts provide the efficient lookup, just with a hash table instead of a tree structure. Then you would just need to worry about insertion order for distinguishing duplicate keys. So there would be ostensible keys and actual keys. If 'spam' is the ostensible key, then the actual key would be (1, 'spam'), or maybe (0, 'spam') for the first insertion of that key. You would just need to override the getitem and setitem to handle the ostensible vs. actual keys. Although it's not clear how you would reassign a key. Frex:
mo_dict['spam'] = 'spam'
mo_dict['spam'] = 'eggs'
Would result in something like {(0, 'spam'): 'spam', (1, 'spam'): 'eggs'}. It's not clear how you would change the value for the actual key (0, 'spam'). Maybe that's not something this data structure is meant to do? And what does mo_dict['spam'] return? ['spam', 'eggs']?

If that was all clarified, than the other features could be implemented pretty easily.

Edit: No, OrderedDict wouldn't work, since it uses insertion order. I guess you would have to implement the tree structure natively, with the extra features. Not that trees are hard to implement in Python.

in my first pass thinking about how to do this in Python i come up with doing a dict of lists. an insert would get a list by dict lookup. if absent, add a new 1-list with the new item into the dict. if present and duplicate not allowed, replace [0] in the found list or make a replacement 1-list (would replace a previous group of duplicates). if duplicates are allowed then append the list. for lookup, if no list found then item not found. if a list is found then get a reference to [0] of the list. if deleting a node, delete or pop it from the list. if the list becomes empty, delete that key from the dict.

what a dict of lists still lacks is keeping a position between keys when a search fails. even if there are no duplicates, a dict does not make it easy to find the "next key ..." up or down from a given key (that may or may not be in the dict). this is what i currently have no solution to in Python (or in any language when using a hash table abstract).

Skaperen · Apr-19-2017, 06:02 AM

my avlmap code was moved into my libh code many years ago. you can download the latest version of libh here: http://stratusrelay.com/free/libh-0.9.8.tar.gz.

you can get a different compression format by changing "gz" to "bz2" or "xz" in the URL if you want to download a smaller file (like over a 300 baud modem).

the sizes and sha256 checksums of the files are:

Output:849357  libh-0.9.8.tar.gz 
520182  libh-0.9.8.tar.bz2
446484  libh-0.9.8.tar.xz
82f25d7297ac5a1cd7fefc9cf55cd1d8dc3d9247761a66e51afc0a5848e4af35  libh-0.9.8.tar.gz
f2fa2d4023f05d144487d18af73fa8c3f92b0a7feb8b012a0d8207d194f4aa1d  libh-0.9.8.tar.bz2
d0ce59810176f0f6a7c41631a7c1296f06929725951156ccea1587609c0524b8  libh-0.9.8.tar.xz

***ichabod801*** · Apr-19-2017, 06:34 AM

If you're dead set on doing this as a dict, rather than just implementing the tree structure, this is still easy:

Subclass UserDict
Add an order attribute.
When assigning to a new key, add the key to order, sort order, and add the value as [value]
When assigning to a current key, append to the list that is the value of that key.

Now you have a dict that knows the sorted order of the keys, and allows for multiple values per key. You can override the standard iterator by combining all of the list values in order, and iterate over that. You could implement next and previous attributes, each storing an index in order (or a tuple of index in order and index within the list value), for stepping forward and backward in the dict.

**Larz60+** · (This post was last modified: Apr-19-2017, 09:20 AM by Larz60+.)

Oh course, hashing uses duplicate keys, didn't think of that but again in reality the hash key is augmented by the data it is carrying with it.

That's only because there is a limited number of cells in the target array.

The way I always designed hash tables was:

1st make sure the length of the hash table is a prime number. This only because it makes for very even distribution. This length was chosen to be close to the number of items you expected to be in the table at any given time.
I didn't worry about using another cell in the table when a collision was encountered, what I did instead was to start a linked list moving laterally from the cell. Then you never had to reallocate the table itself, which is time consuming.
On lookup, if a collision was encountered, it would start a search in the linked list but it would rarely go more than a few nodes before a match was found. This never caused a bottleneck as you would be inclined to think, because the prime number length of the table led to very even distribution (in a large table).

I used this method on many large search algorithms with outstanding success. It is exactly the algorithm I
used for call record processing at LCI International (which became QWest, which is now Century Link).

Skaperen · (This post was last modified: Apr-19-2017, 09:44 AM by Skaperen.)

(Apr-19-2017, 06:34 AM)ichabod801 Wrote: If you're dead set on doing this as a dict, rather than just implementing the tree structure, this is still easy:

Subclass UserDict

Add an order attribute.

When assigning to a new key, add the key to order, sort order, and add the value as [value]

When assigning to a current key, append to the list that is the value of that key.

Now you have a dict that knows the sorted order of the keys, and allows for multiple values per key. You can override the standard iterator by combining all of the list values in order, and iterate over that. You could implement next and previous attributes, each storing an index in order (or a tuple of index in order and index within the list value), for stepping forward and backward in the dict.

i'm not set on anything. i'm just wondering if anyone has done this. if not, maybe i'll try to do it in Python, or maybe not. i may leave it for the future when i learn how to do Python stuff in C.

(Apr-19-2017, 09:20 AM)Larz60+ Wrote: Oh course, hashing uses duplicate keys, didn't think of that but again in reality the hash key is augmented by the data it is carrying with it.

That's only because there is a limited number of cells in the target array.

The way I always designed hash tables was:
1st make sure the length of the hash table is a prime number. This only because it makes for very even distribution. This length was chosen to be close to the number of items you expected to be in the table at any given time.

I didn't worry about using another cell in the table when a collision was encountered, what I did instead was to start a linked list moving laterally from the cell. Then you never had to reallocate the table itself, which is time consuming.

On lookup, if a collision was encountered, it would start a search in the linked list but it would rarely go more than a few nodes before a match was found. This never caused a bottleneck as you would be inclined to think, because the prime number length of the table led to very even distribution (in a large table).

I used this method on many large search algorithms with outstanding success. It is exactly the algorithm I
used for call record processing at LCI International (which became QWest, which is now Century Link).

the way i usually did a hash table was to do another step in the table, reusing the distribution therein. it was unlikely the next step would have a collision. if it did, just keep going. a search had to verify the key. the hash table was basically an optimized cache.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Introducing scalaps: Scala-inspired data structures for Python	matthagy	2	2,302	Mar-03-2019, 01:55 AM Last Post: Larz60+

abstract data structures

User Panel Messages

Announcements