python Checksum No

mg24 · Nov-01-2022, 12:00 PM

Hi team

I am creating a checksum. using below code.
Code is working, My csv is 15gb.

hence I am reading data in chunk.
data = f.read(10240)

is this correct or any better solution available.
thanks for help !

import hashlib
import os
import time
import ReadTime

def chksum(fpath, fname):    
    start = time.time()
    h = hashlib.sha512()
    fullpath = os.path.join(fpath, fname)
    with open(fullpath, 'rb') as f:
        while True:
            data = f.read(10240)
            if not data:
                break
            h.update(data)
            chksum = h.hexdigest()
            fname = fname.replace('.csv', "")
            chksumpath = os.path.join(fpath, 'f{fname}_chksum.csv')
        with open(chksumpath, 'w') as data:
            data.write(chksum)
            tmp = time.time()-start
        print("time taken to create checksum", ReadTime.timetaken(tmp))

rob101 · Nov-01-2022, 12:17 PM

To my mind, sha512 seems to be quite large and possibly overkill.

Would a md4 not be more practical?

**Larz60+** · Nov-01-2022, 06:30 PM

md5 is no longer a secure cryptographic hash.
different messages can hash to the same value.
Thus, md5 is considered broken, and has been for 20 years.

see: https://www.avira.com/en/blog/md5-the-broken-algorithm

DeaD_EyE

You can still use md5 and the other broken algorithms, to compare for equality. If they spit out the same hash, then the content is the same, with a small factor of uncertainty.

The fastest possible hashing speed I got is with the use of mmap.
I compared it also to the md5sum tool, which is written in C.

Quote:[deadeye@nexus ~]$ python hasher.py
Downloads/xxx.mkv: 1.23 GiB

naive took 1.69 s
f27579b6142ae71fb34526374b482433

fast took 1.46 s
f27579b6142ae71fb34526374b482433

md5sum took 1.50 s
f27579b6142ae71fb34526374b482433

The code:

import mmap
import os
import subprocess
import time
from contextlib import wraps
from hashlib import md5


def file_size(file):
    units = "B KiB MiB GiB TiB".split()
    size = os.stat(file).st_size
    for unit in units:
        if size < 1024:
            break
        size /= 1024
    return f"{file}: {size:.2f} {unit}"


def speed(function):
    @wraps(function)
    def inner(*args, **kwargs):
        start = time.perf_counter()
        retval = function(*args, **kwargs)
        stop = time.perf_counter()
        print(f"{function.__name__} took {stop-start:.2f} s")
        return retval

    return inner


# please do not use md5 for cryptography


@speed
def md5sum(file):
    return subprocess.check_output(["md5sum", file], encoding="utf8").split()[0]


@speed
def naive(file):
    chunk_size = 4 * 1024**1
    hasher = md5()
    with open(file, "rb") as fd:
        while chunk := fd.read(chunk_size):
            hasher.update(chunk)

    return hasher.hexdigest()


@speed
def fast(file):
    hasher = md5()
    with open(file, "rb") as fd:
        with mmap.mmap(fd.fileno(), 0, access=mmap.ACCESS_READ) as mm:
            hasher.update(mm)

    return hasher.hexdigest()


file = "Downloads/xxx.mkv"
print(file_size(file))
print()

print(naive(file))
print()
print(fast(file))
print()
print(md5sum(file))

I guess the biggest speed difference is dependent on chunk size. If the chunk size small -> many IOPS. Big chunk size -> less IOPS but more memory consumption.

rob101 · Nov-01-2022, 07:28 PM

(Nov-01-2022, 06:30 PM)Larz60+ Wrote: md5 is no longer a secure cryptographic hash.
different messages can hash to the same value.
Thus, md5 is considered broken, and has been for 20 years.

I fully accept what you say.

The reason I suggested this (MD4 in fact) is that I use said for generating a checksum for items in a CSV file, before the data is accepted. I've filtered 5000+ items (and counting) in this way, and have never had a false positive.

I'll report back if (or when?) I do.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	calculate data using 1 byte checksum	korenron	2	4,288	Nov-23-2021, 07:17 AM Last Post: korenron

python Checksum No

User Panel Messages

Announcements