Dec-26-2022, 11:15 PM
(Dec-26-2022, 08:38 PM)Pavel_47 Wrote: It's about duplicate files (sure, with the same filenames), located in different folders.Should use hash value for a files.
The problem is to get some kind of dataset where each duplicated file (or filename) is associated with locations where this file exist.
Rather than identifying the file content bye name,extension,size or other designation which have no guarantees to work.
If use pathlib will give full path(recursively using
rglob
) where duplicate exists,and hashlib work fine for this.Exampke.
import hashlib from pathlib import Path def compute_hash(file_path): with open(file_path, 'rb') as f: hash_obj = hashlib.md5() hash_obj.update(f.read()) return hash_obj.hexdigest() def find_duplicate(root_path): hashes = {} for file_path in root_path.rglob('*'): # print(file_path) if file_path.is_file(): hash_value = compute_hash(file_path) if hash_value in hashes: print(f'Duplicate file found: {file_path}') else: hashes[hash_value] = file_path if __name__ == '__main__': root_path = Path(r'G:\div_code') find_duplicate(root_path)
Output:Duplicate file found: G:\div_code\test_cs\foo\bar.txt
Duplicate file found: G:\div_code\test_cs\foo\some_folder\egg2.txt