Fastest dict/map method when 'key' is already a hash?

tasket · Apr-13-2019, 03:35 AM

I'm implementing deduplication for a new backup tool based on LVM thin-delta where the destination archive consists only of chunks, together with a manifest file listing the chunk filenames and SHA-256 hash for each.

Since my keys for each chunk are already hashes (the value is the chunk filename), is there a more efficient way to store and find chunks in Python than dict()?

For example:

chunkidx = {}
with open("manifest","r") as mf:
  for ln in mf:
    chash, chname = ln.strip().split()
    chunkidx[chash] = chname

The above compiles a dict chunkidx from the manifest file where the SHA-256 hash is automatically re-hashed internally by Python before storing the key-value pair. I assume this costs some amount of overhead.

When using the index:

while True:
  buf    = zlib.compress(volume.read(bufsize))
  chash  = hashlib.sha256(buf).hexdigest()
  if chash in chunkidx:
    send_link_to_existing_chunk(chunkidx[chash], chash)
  else:
    send_new_chunk(chname, chash)

In the above example, the overhead seems to be compounded by the fact that Python must create a temporary internal hash for the chash variable when searching the chunkidx dict.

(I realize there is also the hexdigest (ascii) vs digest (binary) issue, which is a choice I'll make based on overall efficiency; I'm open to suggestions here as well.)

Additional:

My only idea for greater speed (so far) involves storing the first 4 bytes of chash in some sort of bytearray and using that for a pre-search. If there is a match in the bytearray search, then do a lookup and compare of the full hash values. I may be wrong, but this seems like it could enhance overall search speed.

**Gribouillis** · (This post was last modified: Apr-13-2019, 08:02 AM by Gribouillis.)

Quote:I assume this costs some amount of overhead.

This kind of statements need to be measured with a tool such as the profile module before you do anything. At first sight it seems to me that the calls to zlib.compress() and sha256().hexdigest() may cost much much more than the dictionary lookup.

tasket · Apr-13-2019, 05:20 PM

Thanks for the tip. However, the compression and sha256 overhead is essential functionality and can't be avoided.

My concern was that hashing had already been done once, and using the dict adds two or three more hashing ops on top of that. My understanding of the native Python hashing is that unfortunately it's not fit for strong verification and deduplication, so I don't think I can use the Python dict hashes for those program functions.

I will do some testing to see whether the extra hashing imposed by dict incurs a significant penalty. But consider the range of range of iterations involved: For a 2TB volume, processing with a chunk size of 64KB will result in tens of millions of iterations.

tasket · Apr-19-2019, 02:14 AM

I've also run into another issue with my indexing dict: Its using a lot of RAM and making the system swap.

The issue appears to stem from the fact that even simple integers are rather weighty objects in Python (although switching from hex strings to integers did reduce size by more than 1/3)... and I'm indexing millions of them.

The usual way to deal with this seems to be using sqlite. But in-memory searches are very desirable in this case, so I will be researching other methods perhaps involving 'ctypes'.

**Gribouillis** · Apr-19-2019, 04:57 AM

You could perhaps try alternative container packages such as sortedcontainers

DeaD_EyE · Apr-19-2019, 08:24 AM

You can learn here: https://borgbackup.readthedocs.io/en/stable/
They made a tool like yours, but they optimized parts with Cython.

Sortedcontainers may help, because the items are inserted in the right order.

A simple example with Lists in Python:

class SortedList(list):

    def append(self, value):
        idx = bisect.bisect(self, value)
        super().insert(idx, value)

    def extend(self, iterable):
        for element in iterable:
            self.append(element)

    def __setitem__(self, idx, value):
        del self[idx]
        self.append(value)

    def insert(self, idx, value):
        raise NotImplementedError

    def sort(self):
        raise NotImplementedError

To check how much memory a datastructure takes, you should read this article.

Test results with 1e6 elements:

Output:Dict-Dicts: 361.39 MiB
Dict-Namedtuple: 201.17 MiB
List-List 192.35 MiB
List-Tuple: 177.10 MiB

Test-Code:

# use get_size from example

import os
import sys
from collections import namedtuple

 
def test_dirrent_sizes():
    entry = {'path': 'path'*10, 'size': 1234, 'mtime': 'mtime'}
    st = namedtuple('file', entry.keys())
    length = range(int(1e6))
    print('Dict-Dicts:', end=' ')
    size = get_size({os.urandom(64): entry.copy() for _ in length}) / 1024**2
    print(f'{size:.2f} MiB')
    print('Dict-Namedtuple:', end=' ')
    size = get_size({os.urandom(64): st(*entry.values()) for _ in length}) / 1024**2
    print(f'{size:.2f} MiB')
    print('List-List', end=' ')
    size = get_size([[os.urandom(64), *entry.values()] for _ in length]) / 1024**2
    print(f'{size:.2f} MiB')
    print('List-Tuple:', end=' ')
    size = get_size([(os.urandom(64), *entry.values()) for _ in length]) / 1024**2
    print(f'{size:.2f} MiB')

Making the structures smaller and a automatic sorting by appending a new element, should speed up everything.

tasket · (This post was last modified: Apr-20-2019, 06:44 PM by tasket.)

@DeaD_EyE Thanks for the detailed reply.

Some background: My program is intended to address the shortcomings of borg and similar tools, as it can generate incremental deltas instantly from volume snapshot metadata. It also doesn't require a high degree of network interactivity (which creates multiple security issues), and doesn't cache + retransmit data chunks.

Although I've done some tests of borg, I haven't looked into its dedup indexing yet as I thought a simple dict would be worth trying at first:

dedup_idx[hash_i] = (session_obj, address_i)

session_obj is an object reference that holds a chunk's backup session context (like source volume ID, file path, etc.). Typically, there are only about 10-100 session_obj instances so the relationship is many:1.

hash_i and address_i are integers referencing the sha256 hash and 64bit source-volume address for the chunk, respectively. Using integers and object refs this way reduces memory footprint vs strings by almost half.

Before reading your post, I had found a way to poll memory use this way:

resource.getrusage(resource.RUSAGE_SELF).ru_maxrss * resource.getpagesize()

Output:Without index:  65MB
With index:    611MB (690135 elements)

Taking your suggestion about sorted collections into account, I'm trying to figure out what advantage they provide for building and searching an index of 'random' hashes. The point is to detect collisions, not to keep elements sorted. What I do get from your size profiling is that lists are more space-efficient than dicts. But I don't know how to efficiently search a key in a list of tuples.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Fastest way tkinter	Quatrixouuu	2	381	Feb-19-2024, 07:20 AM Last Post: Danishhafeez
	What is the fastest way to get all the frames from a video file?	glorsh66	3	1,057	May-26-2023, 04:41 AM Last Post: Gribouillis
	[SOLVED] How to crack hash with hashlib	Milan	0	1,395	Mar-09-2023, 08:25 PM Last Post: Milan
	Fastest Way of Writing/Reading Data	JamesA	1	2,181	Jul-27-2021, 03:52 PM Last Post: Larz60+
	Fastest Method for Querying SQL Server with Python Pandas	BuJayBelvin	7	6,845	Aug-02-2020, 06:21 PM Last Post: jefsummers
	Sort a dict in dict	cherry_cherry	4	72,287	Apr-08-2020, 12:25 PM Last Post: perfringo
	Hash command works differently for me in CMD and Spyder	ZweiDCG	3	2,348	Sep-10-2019, 01:10 PM Last Post: DeaD_EyE
	length constraint on phrase hash to password	javaben	0	1,902	Aug-21-2019, 05:34 PM Last Post: javaben
	Create file archive that contains crypto hash	ED209	1	2,041	May-29-2019, 03:05 AM Last Post: heiner55
	fastest way to record values between quotes	paul18fr	5	3,269	Apr-15-2019, 01:51 PM Last Post: snippsat

Fastest dict/map method when 'key' is already a hash?

User Panel Messages

Announcements