Apr-13-2019, 03:35 AM
I'm implementing deduplication for a new backup tool based on LVM thin-delta where the destination archive consists only of chunks, together with a manifest file listing the chunk filenames and SHA-256 hash for each.
Since my keys for each chunk are already hashes (the value is the chunk filename), is there a more efficient way to store and find chunks in Python than dict()?
For example:
When using the index:
(I realize there is also the hexdigest (ascii) vs digest (binary) issue, which is a choice I'll make based on overall efficiency; I'm open to suggestions here as well.)
Additional:
My only idea for greater speed (so far) involves storing the first 4 bytes of chash in some sort of bytearray and using that for a pre-search. If there is a match in the bytearray search, then do a lookup and compare of the full hash values. I may be wrong, but this seems like it could enhance overall search speed.
Since my keys for each chunk are already hashes (the value is the chunk filename), is there a more efficient way to store and find chunks in Python than dict()?
For example:
chunkidx = {} with open("manifest","r") as mf: for ln in mf: chash, chname = ln.strip().split() chunkidx[chash] = chnameThe above compiles a dict chunkidx from the manifest file where the SHA-256 hash is automatically re-hashed internally by Python before storing the key-value pair. I assume this costs some amount of overhead.
When using the index:
while True: buf = zlib.compress(volume.read(bufsize)) chash = hashlib.sha256(buf).hexdigest() if chash in chunkidx: send_link_to_existing_chunk(chunkidx[chash], chash) else: send_new_chunk(chname, chash)In the above example, the overhead seems to be compounded by the fact that Python must create a temporary internal hash for the chash variable when searching the chunkidx dict.
(I realize there is also the hexdigest (ascii) vs digest (binary) issue, which is a choice I'll make based on overall efficiency; I'm open to suggestions here as well.)
Additional:
My only idea for greater speed (so far) involves storing the first 4 bytes of chash in some sort of bytearray and using that for a pre-search. If there is a match in the bytearray search, then do a lookup and compare of the full hash values. I may be wrong, but this seems like it could enhance overall search speed.