Python Forum
Fastest dict/map method when 'key' is already a hash?
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Fastest dict/map method when 'key' is already a hash?
#1
I'm implementing deduplication for a new backup tool based on LVM thin-delta where the destination archive consists only of chunks, together with a manifest file listing the chunk filenames and SHA-256 hash for each.

Since my keys for each chunk are already hashes (the value is the chunk filename), is there a more efficient way to store and find chunks in Python than dict()?



For example:

chunkidx = {}
with open("manifest","r") as mf:
  for ln in mf:
    chash, chname = ln.strip().split()
    chunkidx[chash] = chname
The above compiles a dict chunkidx from the manifest file where the SHA-256 hash is automatically re-hashed internally by Python before storing the key-value pair. I assume this costs some amount of overhead.

When using the index:

while True:
  buf    = zlib.compress(volume.read(bufsize))
  chash  = hashlib.sha256(buf).hexdigest()
  if chash in chunkidx:
    send_link_to_existing_chunk(chunkidx[chash], chash)
  else:
    send_new_chunk(chname, chash)
In the above example, the overhead seems to be compounded by the fact that Python must create a temporary internal hash for the chash variable when searching the chunkidx dict.

(I realize there is also the hexdigest (ascii) vs digest (binary) issue, which is a choice I'll make based on overall efficiency; I'm open to suggestions here as well.)

Additional:

My only idea for greater speed (so far) involves storing the first 4 bytes of chash in some sort of bytearray and using that for a pre-search. If there is a match in the bytearray search, then do a lookup and compare of the full hash values. I may be wrong, but this seems like it could enhance overall search speed.
Reply
#2
Quote:I assume this costs some amount of overhead.
This kind of statements need to be measured with a tool such as the profile module before you do anything. At first sight it seems to me that the calls to zlib.compress() and sha256().hexdigest() may cost much much more than the dictionary lookup.
Reply
#3
Thanks for the tip. However, the compression and sha256 overhead is essential functionality and can't be avoided.

My concern was that hashing had already been done once, and using the dict adds two or three more hashing ops on top of that. My understanding of the native Python hashing is that unfortunately it's not fit for strong verification and deduplication, so I don't think I can use the Python dict hashes for those program functions.

I will do some testing to see whether the extra hashing imposed by dict incurs a significant penalty. But consider the range of range of iterations involved: For a 2TB volume, processing with a chunk size of 64KB will result in tens of millions of iterations.
Reply
#4
I've also run into another issue with my indexing dict: Its using a lot of RAM and making the system swap.

The issue appears to stem from the fact that even simple integers are rather weighty objects in Python (although switching from hex strings to integers did reduce size by more than 1/3)... and I'm indexing millions of them.

The usual way to deal with this seems to be using sqlite. But in-memory searches are very desirable in this case, so I will be researching other methods perhaps involving 'ctypes'.
Reply
#5
You could perhaps try alternative container packages such as sortedcontainers
Reply
#6
You can learn here: https://borgbackup.readthedocs.io/en/stable/
They made a tool like yours, but they optimized parts with Cython.

Sortedcontainers may help, because the items are inserted in the right order.

A simple example with Lists in Python:
class SortedList(list):

    def append(self, value):
        idx = bisect.bisect(self, value)
        super().insert(idx, value)

    def extend(self, iterable):
        for element in iterable:
            self.append(element)

    def __setitem__(self, idx, value):
        del self[idx]
        self.append(value)

    def insert(self, idx, value):
        raise NotImplementedError

    def sort(self):
        raise NotImplementedError
To check how much memory a datastructure takes, you should read this article.

Test results with 1e6 elements:
Output:
Dict-Dicts: 361.39 MiB Dict-Namedtuple: 201.17 MiB List-List 192.35 MiB List-Tuple: 177.10 MiB
Test-Code:
# use get_size from example

import os
import sys
from collections import namedtuple

 
def test_dirrent_sizes():
    entry = {'path': 'path'*10, 'size': 1234, 'mtime': 'mtime'}
    st = namedtuple('file', entry.keys())
    length = range(int(1e6))
    print('Dict-Dicts:', end=' ')
    size = get_size({os.urandom(64): entry.copy() for _ in length}) / 1024**2
    print(f'{size:.2f} MiB')
    print('Dict-Namedtuple:', end=' ')
    size = get_size({os.urandom(64): st(*entry.values()) for _ in length}) / 1024**2
    print(f'{size:.2f} MiB')
    print('List-List', end=' ')
    size = get_size([[os.urandom(64), *entry.values()] for _ in length]) / 1024**2
    print(f'{size:.2f} MiB')
    print('List-Tuple:', end=' ')
    size = get_size([(os.urandom(64), *entry.values()) for _ in length]) / 1024**2
    print(f'{size:.2f} MiB')
Making the structures smaller and a automatic sorting by appending a new element, should speed up everything.
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Reply
#7
@DeaD_EyE Thanks for the detailed reply.

Some background: My program is intended to address the shortcomings of borg and similar tools, as it can generate incremental deltas instantly from volume snapshot metadata. It also doesn't require a high degree of network interactivity (which creates multiple security issues), and doesn't cache + retransmit data chunks.

Although I've done some tests of borg, I haven't looked into its dedup indexing yet as I thought a simple dict would be worth trying at first:

dedup_idx[hash_i] = (session_obj, address_i)
session_obj is an object reference that holds a chunk's backup session context (like source volume ID, file path, etc.). Typically, there are only about 10-100 session_obj instances so the relationship is many:1.

hash_i and address_i are integers referencing the sha256 hash and 64bit source-volume address for the chunk, respectively. Using integers and object refs this way reduces memory footprint vs strings by almost half.

Before reading your post, I had found a way to poll memory use this way:
resource.getrusage(resource.RUSAGE_SELF).ru_maxrss * resource.getpagesize()
Output:
Without index: 65MB With index: 611MB (690135 elements)
Taking your suggestion about sorted collections into account, I'm trying to figure out what advantage they provide for building and searching an index of 'random' hashes. The point is to detect collisions, not to keep elements sorted. What I do get from your size profiling is that lists are more space-efficient than dicts. But I don't know how to efficiently search a key in a list of tuples.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Fastest way tkinter Quatrixouuu 2 349 Feb-19-2024, 07:20 AM
Last Post: Danishhafeez
  What is the fastest way to get all the frames from a video file? glorsh66 3 992 May-26-2023, 04:41 AM
Last Post: Gribouillis
  [SOLVED] How to crack hash with hashlib Milan 0 1,325 Mar-09-2023, 08:25 PM
Last Post: Milan
  Fastest Way of Writing/Reading Data JamesA 1 2,138 Jul-27-2021, 03:52 PM
Last Post: Larz60+
  Fastest Method for Querying SQL Server with Python Pandas BuJayBelvin 7 6,771 Aug-02-2020, 06:21 PM
Last Post: jefsummers
  Sort a dict in dict cherry_cherry 4 62,483 Apr-08-2020, 12:25 PM
Last Post: perfringo
  Hash command works differently for me in CMD and Spyder ZweiDCG 3 2,302 Sep-10-2019, 01:10 PM
Last Post: DeaD_EyE
  length constraint on phrase hash to password javaben 0 1,880 Aug-21-2019, 05:34 PM
Last Post: javaben
  Create file archive that contains crypto hash ED209 1 2,006 May-29-2019, 03:05 AM
Last Post: heiner55
  fastest way to record values between quotes paul18fr 5 3,239 Apr-15-2019, 01:51 PM
Last Post: snippsat

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020