@DeaD_EyE Thanks for the detailed reply.
Some background: My program is intended to address the shortcomings of borg and similar tools, as it can generate incremental deltas instantly from volume snapshot metadata. It also doesn't require a high degree of network interactivity (which creates multiple security issues), and doesn't cache + retransmit data chunks.
Although I've done some tests of borg, I haven't looked into its dedup indexing yet as I thought a simple dict would be worth trying at first:
Before reading your post, I had found a way to poll memory use this way:
Some background: My program is intended to address the shortcomings of borg and similar tools, as it can generate incremental deltas instantly from volume snapshot metadata. It also doesn't require a high degree of network interactivity (which creates multiple security issues), and doesn't cache + retransmit data chunks.
Although I've done some tests of borg, I haven't looked into its dedup indexing yet as I thought a simple dict would be worth trying at first:
dedup_idx[hash_i] = (session_obj, address_i)
session_obj
is an object reference that holds a chunk's backup session context (like source volume ID, file path, etc.). Typically, there are only about 10-100 session_obj instances so the relationship is many:1.hash_i
and address_i
are integers referencing the sha256 hash and 64bit source-volume address for the chunk, respectively. Using integers and object refs this way reduces memory footprint vs strings by almost half.Before reading your post, I had found a way to poll memory use this way:
resource.getrusage(resource.RUSAGE_SELF).ru_maxrss * resource.getpagesize()
Output:Without index: 65MB
With index: 611MB (690135 elements)
Taking your suggestion about sorted collections into account, I'm trying to figure out what advantage they provide for building and searching an index of 'random' hashes. The point is to detect collisions, not to keep elements sorted. What I do get from your size profiling is that lists are more space-efficient than dicts. But I don't know how to efficiently search a key in a list of tuples.