Python Forum

Full Version: Hashing big files
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
I have to hash big files and even with fast algorithms, it takes a lot of time. Regular hdd, not ssd storages. I've tried the usual way but it is slow.
I am thinking to load a file check the first 4k and skip several megabytes then check again another 4k of data adding it to the hash sum.

I am asking how much megabytes is safe to skip in that process? I have to be sure that there are no collisions between the hashes if the files are not the same. Does that approach is going to work?
Could you please, elaborate?

When I think of file hashes, I think of keys being hashed, with corresponding file positions,
stored in either a memory table or as a separate file on disk.
Like md5 or sha256 sum. Or B2 for instance.
If you skip megabytes there will be obvious collision. One only needs to change the skipped bytes to create a collision.
I am doing it for myself. There is no one who can do this.
It depends on what you're trying to do with the hashes, but one thing I've done to save time with hashing when looking for duplicate files is to only bother hashing files that have the same file size (this works great with media like audio, video and pictures).
I did the same but it's still slow.
Out of curiosity, what kind of data are you working with? And what is the ultimate problem to be solved. e.g. hashing is an implementation detail of "detect duplicate files", not the actual goal.
The files are mostly media files. Video. Several GB each. Music.