Search for duplicated files

wavic · Oct-12-2017, 11:54 PM

os.walk is using os.scandir. At least in Python 3.6.
However, I have tried to make it faster using concurrent.futures.ProcessPoolExecutor to run it on all cores but don't know how much. I was thinking that the md5 hashing is going to be a CPU heavy operation but it turns out that it's not. On my old laptop. It depends on the disk performance.
I have changed from md5 to sha1 because of the possibility of equal hash sums from different files. It's minimal but still... I may change it again to blake2b. If it turns out that the CPU is doing well I will try to use asyncio instead. I have tried already but without success. This library is making my head to explode. Also, will try to take pieces of the file for hashing, not the whole file. According to their website, blake2b can do 1GB per second.
I did a few more changes. If I run the script to scan my /home dir it exits with errors if I am using the web browser at the same time for example. Which is normal because of the .chache directory. I am ignoring those errors but seems it's better to make it skip the whole dir.

hbknjr · Oct-13-2017, 07:22 AM

Tried blake2b but got better results with xxhash.

I found out that Blocksize is of much significance. Overall got better results(with larger files) when used 2**17(131072 bytes) instead of 1mb.

Using concurrent.futures worsened the execution time and multiprocessing.Pool seemed to have no effect, idk may it has to do something with how hashes are calculated.

My current updated script took 610 seconds(10 minutes) to find 14170 duplicates in 53,491 Files and 7,988 Folders of total 165 GB.

Now I'll try to change block/buffer size according to the file size but i don't know the exact effect buffer size have on the hashing speed in accordance to filesize.

Search for duplicated files

User Panel Messages

Announcements