Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
python Checksum No
#1
Hi team

I am creating a checksum. using below code.
Code is working, My csv is 15gb.

hence I am reading data in chunk.
data = f.read(10240)


is this correct or any better solution available.
thanks for help !



import hashlib
import os
import time
import ReadTime

def chksum(fpath, fname):    
    start = time.time()
    h = hashlib.sha512()
    fullpath = os.path.join(fpath, fname)
    with open(fullpath, 'rb') as f:
        while True:
            data = f.read(10240)
            if not data:
                break
            h.update(data)
            chksum = h.hexdigest()
            fname = fname.replace('.csv', "")
            chksumpath = os.path.join(fpath, 'f{fname}_chksum.csv')
        with open(chksumpath, 'w') as data:
            data.write(chksum)
            tmp = time.time()-start
        print("time taken to create checksum", ReadTime.timetaken(tmp))
Reply
#2
To my mind, sha512 seems to be quite large and possibly overkill.

Would a md4 not be more practical?
Sig:
>>> import this

The UNIX philosophy: "Do one thing, and do it well."

"The danger of computers becoming like humans is not as great as the danger of humans becoming like computers." :~ Konrad Zuse

"Everything should be made as simple as possible, but not simpler." :~ Albert Einstein
Reply
#3
md5 is no longer a secure cryptographic hash.
different messages can hash to the same value.
Thus, md5 is considered broken, and has been for 20 years.

see: https://www.avira.com/en/blog/md5-the-broken-algorithm
Reply
#4
You can still use md5 and the other broken algorithms, to compare for equality. If they spit out the same hash, then the content is the same, with a small factor of uncertainty.

The fastest possible hashing speed I got is with the use of mmap.
I compared it also to the md5sum tool, which is written in C.

Quote:[deadeye@nexus ~]$ python hasher.py
Downloads/xxx.mkv: 1.23 GiB

naive took 1.69 s
f27579b6142ae71fb34526374b482433

fast took 1.46 s
f27579b6142ae71fb34526374b482433

md5sum took 1.50 s
f27579b6142ae71fb34526374b482433


The code:
import mmap
import os
import subprocess
import time
from contextlib import wraps
from hashlib import md5


def file_size(file):
    units = "B KiB MiB GiB TiB".split()
    size = os.stat(file).st_size
    for unit in units:
        if size < 1024:
            break
        size /= 1024
    return f"{file}: {size:.2f} {unit}"


def speed(function):
    @wraps(function)
    def inner(*args, **kwargs):
        start = time.perf_counter()
        retval = function(*args, **kwargs)
        stop = time.perf_counter()
        print(f"{function.__name__} took {stop-start:.2f} s")
        return retval

    return inner


# please do not use md5 for cryptography


@speed
def md5sum(file):
    return subprocess.check_output(["md5sum", file], encoding="utf8").split()[0]


@speed
def naive(file):
    chunk_size = 4 * 1024**1
    hasher = md5()
    with open(file, "rb") as fd:
        while chunk := fd.read(chunk_size):
            hasher.update(chunk)

    return hasher.hexdigest()


@speed
def fast(file):
    hasher = md5()
    with open(file, "rb") as fd:
        with mmap.mmap(fd.fileno(), 0, access=mmap.ACCESS_READ) as mm:
            hasher.update(mm)

    return hasher.hexdigest()


file = "Downloads/xxx.mkv"
print(file_size(file))
print()

print(naive(file))
print()
print(fast(file))
print()
print(md5sum(file))
I guess the biggest speed difference is dependent on chunk size. If the chunk size small -> many IOPS. Big chunk size -> less IOPS but more memory consumption.
Larz60+ likes this post
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Reply
#5
(Nov-01-2022, 06:30 PM)Larz60+ Wrote: md5 is no longer a secure cryptographic hash.
different messages can hash to the same value.
Thus, md5 is considered broken, and has been for 20 years.

I fully accept what you say.

The reason I suggested this (MD4 in fact) is that I use said for generating a checksum for items in a CSV file, before the data is accepted. I've filtered 5000+ items (and counting) in this way, and have never had a false positive.

I'll report back if (or when?) I do.
Sig:
>>> import this

The UNIX philosophy: "Do one thing, and do it well."

"The danger of computers becoming like humans is not as great as the danger of humans becoming like computers." :~ Konrad Zuse

"Everything should be made as simple as possible, but not simpler." :~ Albert Einstein
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  calculate data using 1 byte checksum korenron 2 4,288 Nov-23-2021, 07:17 AM
Last Post: korenron

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020