Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Hashing big files
#1
I have to hash big files and even with fast algorithms, it takes a lot of time. Regular hdd, not ssd storages. I've tried the usual way but it is slow.
I am thinking to load a file check the first 4k and skip several megabytes then check again another 4k of data adding it to the hash sum.

I am asking how much megabytes is safe to skip in that process? I have to be sure that there are no collisions between the hashes if the files are not the same. Does that approach is going to work?
"As they say in Mexico 'dosvidaniya'. That makes two vidaniyas."
https://freedns.afraid.org
Reply
#2
Could you please, elaborate?

When I think of file hashes, I think of keys being hashed, with corresponding file positions,
stored in either a memory table or as a separate file on disk.
Reply
#3
Like md5 or sha256 sum. Or B2 for instance.
"As they say in Mexico 'dosvidaniya'. That makes two vidaniyas."
https://freedns.afraid.org
Reply
#4
If you skip megabytes there will be obvious collision. One only needs to change the skipped bytes to create a collision.
Reply
#5
I am doing it for myself. There is no one who can do this.
"As they say in Mexico 'dosvidaniya'. That makes two vidaniyas."
https://freedns.afraid.org
Reply
#6
It depends on what you're trying to do with the hashes, but one thing I've done to save time with hashing when looking for duplicate files is to only bother hashing files that have the same file size (this works great with media like audio, video and pictures).
Feel like you're not getting the answers you want? Checkout the help/rules for things like what to include/not include in a post, how to use code tags, how to ask smart questions, and more.

Pro-tip - there's an inverse correlation between the number of lines of code posted and my enthusiasm for helping with a question :)
Reply
#7
I did the same but it's still slow.
"As they say in Mexico 'dosvidaniya'. That makes two vidaniyas."
https://freedns.afraid.org
Reply
#8
Out of curiosity, what kind of data are you working with? And what is the ultimate problem to be solved. e.g. hashing is an implementation detail of "detect duplicate files", not the actual goal.
Feel like you're not getting the answers you want? Checkout the help/rules for things like what to include/not include in a post, how to use code tags, how to ask smart questions, and more.

Pro-tip - there's an inverse correlation between the number of lines of code posted and my enthusiasm for helping with a question :)
Reply
#9
The files are mostly media files. Video. Several GB each. Music.
"As they say in Mexico 'dosvidaniya'. That makes two vidaniyas."
https://freedns.afraid.org
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Hashing tuples remow 7 815 Dec-06-2019, 11:04 PM
Last Post: remow

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020