Bottom Page

Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
 Fastest dict/map method when 'key' is already a hash?
#1
I'm implementing deduplication for a new backup tool based on LVM thin-delta where the destination archive consists only of chunks, together with a manifest file listing the chunk filenames and SHA-256 hash for each.

Since my keys for each chunk are already hashes (the value is the chunk filename), is there a more efficient way to store and find chunks in Python than dict()?



For example:

chunkidx = {}
with open("manifest","r") as mf:
  for ln in mf:
    chash, chname = ln.strip().split()
    chunkidx[chash] = chname
The above compiles a dict chunkidx from the manifest file where the SHA-256 hash is automatically re-hashed internally by Python before storing the key-value pair. I assume this costs some amount of overhead.

When using the index:

while True:
  buf    = zlib.compress(volume.read(bufsize))
  chash  = hashlib.sha256(buf).hexdigest()
  if chash in chunkidx:
    send_link_to_existing_chunk(chunkidx[chash], chash)
  else:
    send_new_chunk(chname, chash)
In the above example, the overhead seems to be compounded by the fact that Python must create a temporary internal hash for the chash variable when searching the chunkidx dict.

(I realize there is also the hexdigest (ascii) vs digest (binary) issue, which is a choice I'll make based on overall efficiency; I'm open to suggestions here as well.)

Additional:

My only idea for greater speed (so far) involves storing the first 4 bytes of chash in some sort of bytearray and using that for a pre-search. If there is a match in the bytearray search, then do a lookup and compare of the full hash values. I may be wrong, but this seems like it could enhance overall search speed.
Quote
#2
Quote:I assume this costs some amount of overhead.
This kind of statements need to be measured with a tool such as the profile module before you do anything. At first sight it seems to me that the calls to zlib.compress() and sha256().hexdigest() may cost much much more than the dictionary lookup.
Quote
#3
Thanks for the tip. However, the compression and sha256 overhead is essential functionality and can't be avoided.

My concern was that hashing had already been done once, and using the dict adds two or three more hashing ops on top of that. My understanding of the native Python hashing is that unfortunately it's not fit for strong verification and deduplication, so I don't think I can use the Python dict hashes for those program functions.

I will do some testing to see whether the extra hashing imposed by dict incurs a significant penalty. But consider the range of range of iterations involved: For a 2TB volume, processing with a chunk size of 64KB will result in tens of millions of iterations.
Quote
#4
I've also run into another issue with my indexing dict: Its using a lot of RAM and making the system swap.

The issue appears to stem from the fact that even simple integers are rather weighty objects in Python (although switching from hex strings to integers did reduce size by more than 1/3)... and I'm indexing millions of them.

The usual way to deal with this seems to be using sqlite. But in-memory searches are very desirable in this case, so I will be researching other methods perhaps involving 'ctypes'.
Quote
#5
You could perhaps try alternative container packages such as sortedcontainers
Quote
#6
You can learn here: https://borgbackup.readthedocs.io/en/stable/
They made a tool like yours, but they optimized parts with Cython.

Sortedcontainers may help, because the items are inserted in the right order.

A simple example with Lists in Python:
class SortedList(list):

    def append(self, value):
        idx = bisect.bisect(self, value)
        super().insert(idx, value)

    def extend(self, iterable):
        for element in iterable:
            self.append(element)

    def __setitem__(self, idx, value):
        del self[idx]
        self.append(value)

    def insert(self, idx, value):
        raise NotImplementedError

    def sort(self):
        raise NotImplementedError
To check how much memory a datastructure takes, you should read this article.

Test results with 1e6 elements:
Output:
Dict-Dicts: 361.39 MiB Dict-Namedtuple: 201.17 MiB List-List 192.35 MiB List-Tuple: 177.10 MiB
Test-Code:
# use get_size from example

import os
import sys
from collections import namedtuple

 
def test_dirrent_sizes():
    entry = {'path': 'path'*10, 'size': 1234, 'mtime': 'mtime'}
    st = namedtuple('file', entry.keys())
    length = range(int(1e6))
    print('Dict-Dicts:', end=' ')
    size = get_size({os.urandom(64): entry.copy() for _ in length}) / 1024**2
    print(f'{size:.2f} MiB')
    print('Dict-Namedtuple:', end=' ')
    size = get_size({os.urandom(64): st(*entry.values()) for _ in length}) / 1024**2
    print(f'{size:.2f} MiB')
    print('List-List', end=' ')
    size = get_size([[os.urandom(64), *entry.values()] for _ in length]) / 1024**2
    print(f'{size:.2f} MiB')
    print('List-Tuple:', end=' ')
    size = get_size([(os.urandom(64), *entry.values()) for _ in length]) / 1024**2
    print(f'{size:.2f} MiB')
Making the structures smaller and a automatic sorting by appending a new element, should speed up everything.
Gribouillis likes this post
My code examples are always for Python >=3.6.0
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Quote
#7
@Dead_eye Thanks for the detailed reply.

Some background: My program is intended to address the shortcomings of borg and similar tools, as it can generate incremental deltas instantly from volume snapshot metadata. It also doesn't require a high degree of network interactivity (which creates multiple security issues), and doesn't cache + retransmit data chunks.

Although I've done some tests of borg, I haven't looked into its dedup indexing yet as I thought a simple dict would be worth trying at first:

dedup_idx[hash_i] = (session_obj, address_i)
session_obj is an object reference that holds a chunk's backup session context (like source volume ID, file path, etc.). Typically, there are only about 10-100 session_obj instances so the relationship is many:1.

hash_i and address_i are integers referencing the sha256 hash and 64bit source-volume address for the chunk, respectively. Using integers and object refs this way reduces memory footprint vs strings by almost half.

Before reading your post, I had found a way to poll memory use this way:
resource.getrusage(resource.RUSAGE_SELF).ru_maxrss * resource.getpagesize()
Output:
Without index: 65MB With index: 611MB (690135 elements)
Taking your suggestion about sorted collections into account, I'm trying to figure out what advantage they provide for building and searching an index of 'random' hashes. The point is to detect collisions, not to keep elements sorted. What I do get from your size profiling is that lists are more space-efficient than dicts. But I don't know how to efficiently search a key in a list of tuples.
Quote

Top Page

Possibly Related Threads...
Thread Author Replies Views Last Post
  Hash command works differently for me in CMD and Spyder ZweiDCG 3 209 Sep-10-2019, 01:10 PM
Last Post: DeaD_EyE
  length constraint on phrase hash to password javaben 0 260 Aug-21-2019, 05:34 PM
Last Post: javaben
  Create file archive that contains crypto hash ED209 1 253 May-29-2019, 03:05 AM
Last Post: heiner55
  fastest way to record values between quotes paul18fr 5 555 Apr-15-2019, 01:51 PM
Last Post: snippsat
  hash v2 and v3 help Normalitie 7 1,011 Mar-22-2018, 01:57 PM
Last Post: DeaD_EyE
  Fastest Way to declare this list. Kowalski 2 865 Feb-21-2018, 06:26 PM
Last Post: DeaD_EyE
  virtualenv activate.ps1 hash error po20 2 1,252 Jan-13-2018, 09:21 AM
Last Post: po20
  Hash function in Python rachelrosemond 3 1,789 Sep-29-2017, 02:57 PM
Last Post: ichabod801
  Python Hash list check here2learn 4 2,433 Feb-27-2017, 08:35 PM
Last Post: here2learn
  fast hash function verstapp 3 2,638 Dec-13-2016, 07:10 PM
Last Post: verstapp

Forum Jump:


Users browsing this thread: 1 Guest(s)