Python Forum - Find duplicate files in multiple directories

Hello,

This snippet allows to find duplicate files in multiple directories and count the number of duplicates.

import os

dir1 = '/...path1/'
dir2 = '/...path1/'
dir3 = '/...path3/'
dir4 = '/...path4/'
dir5 = '/...path5/'

list1 = [dir1, dir2, dir3, dir4, dir5]

files = []
for folder in list1:
    files = files + [f for f in os.listdir(folder)]

dup = {x for x in files if files.count(x) > 1}

for item in sorted(dup):
    print('{}\t{}'.format(item, files.count(item)))

What I'm looking for is to create a dataset (maybe a dictionary) where each duplicate file is associated with the locations where the duplicates of the file occur.
Any ideas ?
Thanks.

Hello,

possible blueprint:

Make files a Dictionary, create for each file a key with a list as the value and and the file's corresponding path to the list.
Once the dict is complete, you can iterate over the dict looking for values with a length > 1.

To iterate over a directory, you may want to use the newer / nicer pathlib module. To iterate over files in a directory non-recursively:

>>> from pathlib import Path
>>> p = Path('/path/to/dir')
>>> for child in p.iterdir():
...     print(child)

To iterate recursively:

>>> from pathlib import Path
>>> p = Path('path/ot/dir')
>>> for file in p.glob('*.*'):
...    print(file)

Regards, noisefloor

Is this something like what you want to do?

from collections import defaultdict

groups = {
    'A': (1, 2, 3),
    'B': (1, 4, 5),
    'C': (3, 5, 7),
    'D': (1, 7)
}

numbers = defaultdict(list)
for group in groups:
    for number in groups[group]:
        numbers[number].append(group)
print(numbers)

Output:
defaultdict(<class 'list'>, {1: ['A', 'B', 'D'], 2: ['A'], 3: ['A', 'C'], 4: ['B'], 5: ['B', 'C'], 7: ['C', 'D']})

Except in your case the dictionary keys would be filenames and the dictionary values a list of folders that contain that filename?

You cannot sort a dictionary, but you can build a dictionary that is sorted. This sorts the dictionary items and creates a new dictionary from the sorted dictionary items.

def dict_sort(src, key=None, reverse=False):
    return dict(sorted(src.items(), key=key, reverse=reverse))

print(dict_sort(numbers, key=lambda x: len(x[1]), reverse=True))

Output:
{1: ['A', 'B', 'D'], 3: ['A', 'C'], 5: ['B', 'C'], 7: ['C', 'D'], 2: ['A'], 4: ['B']}

(Dec-25-2022, 07:38 PM)noisefloor Wrote: [ -> ]Hello,
possible blueprint:
Make files a Dictionary, create for each file a key with a list as the value and and the file's corresponding path to the list.
Once the dict is complete, you can iterate over the dict looking for values with a length > 1.
Regards, noisefloor

Hello,
I tried this ... but 'location' values aren't added but updated.

list1 = [dir1, dir2, dir3, dir4, dir5]

files = {}
for folder in list1:
    files.update({f:folder for f in os.listdir(folder)})

The comprehension expression has to be modified in some way.

Neither update() or a dictionary comprehension are the right tools for this problem.

update() will not work because you are not always adding keys to the dictionary, sometimes you want to modify existing dictionary values.

Is it about duplicate filenames or duplicate files?

Files with same name aren’t guaranteed to be duplicate files and vice versa - files with different names aren’t guaranteed not to be duplicate files.

(Dec-26-2022, 08:23 PM)perfringo Wrote: [ -> ]Is it about duplicate filenames or duplicate files?

Files with same name aren’t guaranteed to be duplicate files and vice versa - files with different names aren’t guaranteed not to be duplicate files.

It's about duplicate files (sure, with the same filenames), located in different folders.
The problem is to get some kind of dataset where each duplicated file (or filename) is associated with locations where this file exist.

(Dec-26-2022, 08:38 PM)Pavel_47 Wrote: [ -> ]It's about duplicate files (sure, with the same filenames), located in different folders.
The problem is to get some kind of dataset where each duplicated file (or filename) is associated with locations where this file exist.

Should use hash value for a files.
Rather than identifying the file content bye name,extension,size or other designation which have no guarantees to work.
If use pathlib will give full path(recursively using rglob) where duplicate exists,and hashlib work fine for this.
Exampke.

import hashlib
from pathlib import Path

def compute_hash(file_path):
    with open(file_path, 'rb') as f:
        hash_obj = hashlib.md5()
        hash_obj.update(f.read())
        return hash_obj.hexdigest()

def find_duplicate(root_path):
    hashes = {}
    for file_path in root_path.rglob('*'):
        # print(file_path)
        if file_path.is_file():
            hash_value = compute_hash(file_path)
            if hash_value in hashes:
                print(f'Duplicate file found: {file_path}')
            else:
                hashes[hash_value] = file_path

if __name__ == '__main__':
    root_path = Path(r'G:\div_code')
    find_duplicate(root_path)

Output:Duplicate file found: G:\div_code\test_cs\foo\bar.txt
Duplicate file found: G:\div_code\test_cs\foo\some_folder\egg2.txt

(Dec-26-2022, 05:22 PM)deanhystad Wrote: [ -> ]Neither update() or a dictionary comprehension are the right tools for this problem.

update() will not work because you are not always adding keys to the dictionary, sometimes you want to modify existing dictionary values.

Indeed, .... Neither update() or a dictionary comprehension are the right tools for this problem.
Here is the solution that works ...

list1 = [dir1, dir2, dir3, dir4, dir5]

files = {}
for folder in list1:
    for f in os.listdir(folder):
        if f in files:
            files[f] = files[f] + [folder]
        else:
            files[f] = [folder]

for k, v in files.items():
    if len(v) > 1:
        print(f'{k:<60}{len(v):<3}{v}')

defaultdict can clean this part up.

for f in os.listdir(folder):
    if f in files:
        files[f] = files[f] + [folder]
    else:
        files[f] = [folder]
# Using a defaultdict to make the list as needed
files = collections.defaultdict(list)
for f in os.listdir(folder):
    files[f].append(folder)

files.append(folder) is about 10x faster than files = files + [folder], probably because you aren't disposing of the old list each time you append an item.