Python Forum
Hashing big files - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: General (https://python-forum.io/Forum-General)
+--- Forum: News and Discussions (https://python-forum.io/Forum-News-and-Discussions)
+--- Thread: Hashing big files (/Thread-Hashing-big-files)



Hashing big files - wavic - Mar-25-2018

I have to hash big files and even with fast algorithms, it takes a lot of time. Regular hdd, not ssd storages. I've tried the usual way but it is slow.
I am thinking to load a file check the first 4k and skip several megabytes then check again another 4k of data adding it to the hash sum.

I am asking how much megabytes is safe to skip in that process? I have to be sure that there are no collisions between the hashes if the files are not the same. Does that approach is going to work?


RE: Hashing big files - Larz60+ - Mar-25-2018

Could you please, elaborate?

When I think of file hashes, I think of keys being hashed, with corresponding file positions,
stored in either a memory table or as a separate file on disk.


RE: Hashing big files - wavic - Mar-25-2018

Like md5 or sha256 sum. Or B2 for instance.


RE: Hashing big files - Gribouillis - Mar-25-2018

If you skip megabytes there will be obvious collision. One only needs to change the skipped bytes to create a collision.


RE: Hashing big files - wavic - Mar-25-2018

I am doing it for myself. There is no one who can do this.


RE: Hashing big files - micseydel - Apr-06-2018

It depends on what you're trying to do with the hashes, but one thing I've done to save time with hashing when looking for duplicate files is to only bother hashing files that have the same file size (this works great with media like audio, video and pictures).


RE: Hashing big files - wavic - Apr-06-2018

I did the same but it's still slow.


RE: Hashing big files - micseydel - Apr-06-2018

Out of curiosity, what kind of data are you working with? And what is the ultimate problem to be solved. e.g. hashing is an implementation detail of "detect duplicate files", not the actual goal.


RE: Hashing big files - wavic - Apr-06-2018

The files are mostly media files. Video. Several GB each. Music.