Find duplicate files in multiple directories

Pavel_47 · Dec-25-2022, 02:32 PM

Hello,

This snippet allows to find duplicate files in multiple directories and count the number of duplicates.

import os

dir1 = '/...path1/'
dir2 = '/...path1/'
dir3 = '/...path3/'
dir4 = '/...path4/'
dir5 = '/...path5/'

list1 = [dir1, dir2, dir3, dir4, dir5]

files = []
for folder in list1:
    files = files + [f for f in os.listdir(folder)]

dup = {x for x in files if files.count(x) > 1}

for item in sorted(dup):
    print('{}\t{}'.format(item, files.count(item)))

What I'm looking for is to create a dataset (maybe a dictionary) where each duplicate file is associated with the locations where the duplicates of the file occur.
Any ideas ?
Thanks.

noisefloor · Dec-25-2022, 07:38 PM

Hello,

possible blueprint:

Make files a Dictionary, create for each file a key with a list as the value and and the file's corresponding path to the list.
Once the dict is complete, you can iterate over the dict looking for values with a length > 1.

To iterate over a directory, you may want to use the newer / nicer pathlib module. To iterate over files in a directory non-recursively:

>>> from pathlib import Path
>>> p = Path('/path/to/dir')
>>> for child in p.iterdir():
...     print(child)

To iterate recursively:

>>> from pathlib import Path
>>> p = Path('path/ot/dir')
>>> for file in p.glob('*.*'):
...    print(file)

Regards, noisefloor

**deanhystad** · Dec-26-2022, 02:32 AM

Is this something like what you want to do?

from collections import defaultdict

groups = {
    'A': (1, 2, 3),
    'B': (1, 4, 5),
    'C': (3, 5, 7),
    'D': (1, 7)
}

numbers = defaultdict(list)
for group in groups:
    for number in groups[group]:
        numbers[number].append(group)
print(numbers)

Output:
defaultdict(<class 'list'>, {1: ['A', 'B', 'D'], 2: ['A'], 3: ['A', 'C'], 4: ['B'], 5: ['B', 'C'], 7: ['C', 'D']})

Except in your case the dictionary keys would be filenames and the dictionary values a list of folders that contain that filename?

You cannot sort a dictionary, but you can build a dictionary that is sorted. This sorts the dictionary items and creates a new dictionary from the sorted dictionary items.

def dict_sort(src, key=None, reverse=False):
    return dict(sorted(src.items(), key=key, reverse=reverse))

print(dict_sort(numbers, key=lambda x: len(x[1]), reverse=True))

Output:
{1: ['A', 'B', 'D'], 3: ['A', 'C'], 5: ['B', 'C'], 7: ['C', 'D'], 2: ['A'], 4: ['B']}

Pavel_47 · Dec-26-2022, 03:15 PM

(Dec-25-2022, 07:38 PM)noisefloor Wrote: Hello,
possible blueprint:
Make files a Dictionary, create for each file a key with a list as the value and and the file's corresponding path to the list.
Once the dict is complete, you can iterate over the dict looking for values with a length > 1.
Regards, noisefloor

Hello,
I tried this ... but 'location' values aren't added but updated.

list1 = [dir1, dir2, dir3, dir4, dir5]

files = {}
for folder in list1:
    files.update({f:folder for f in os.listdir(folder)})

The comprehension expression has to be modified in some way.

**deanhystad** · Dec-26-2022, 05:22 PM

Neither update() or a dictionary comprehension are the right tools for this problem.

update() will not work because you are not always adding keys to the dictionary, sometimes you want to modify existing dictionary values.

**perfringo** · Dec-26-2022, 08:23 PM

Is it about duplicate filenames or duplicate files?

Files with same name aren’t guaranteed to be duplicate files and vice versa - files with different names aren’t guaranteed not to be duplicate files.

Pavel_47 · Dec-26-2022, 08:38 PM

(Dec-26-2022, 08:23 PM)perfringo Wrote: Is it about duplicate filenames or duplicate files?

Files with same name aren’t guaranteed to be duplicate files and vice versa - files with different names aren’t guaranteed not to be duplicate files.

It's about duplicate files (sure, with the same filenames), located in different folders.
The problem is to get some kind of dataset where each duplicated file (or filename) is associated with locations where this file exist.

***snippsat*** · Dec-26-2022, 11:15 PM

(Dec-26-2022, 08:38 PM)Pavel_47 Wrote: It's about duplicate files (sure, with the same filenames), located in different folders.
The problem is to get some kind of dataset where each duplicated file (or filename) is associated with locations where this file exist.

Should use hash value for a files.
Rather than identifying the file content bye name,extension,size or other designation which have no guarantees to work.
If use pathlib will give full path(recursively using rglob) where duplicate exists,and hashlib work fine for this.
Exampke.

import hashlib
from pathlib import Path

def compute_hash(file_path):
    with open(file_path, 'rb') as f:
        hash_obj = hashlib.md5()
        hash_obj.update(f.read())
        return hash_obj.hexdigest()

def find_duplicate(root_path):
    hashes = {}
    for file_path in root_path.rglob('*'):
        # print(file_path)
        if file_path.is_file():
            hash_value = compute_hash(file_path)
            if hash_value in hashes:
                print(f'Duplicate file found: {file_path}')
            else:
                hashes[hash_value] = file_path

if __name__ == '__main__':
    root_path = Path(r'G:\div_code')
    find_duplicate(root_path)

Output:Duplicate file found: G:\div_code\test_cs\foo\bar.txt
Duplicate file found: G:\div_code\test_cs\foo\some_folder\egg2.txt

Pavel_47 · Dec-27-2022, 09:17 AM

(Dec-26-2022, 05:22 PM)deanhystad Wrote: Neither update() or a dictionary comprehension are the right tools for this problem.

update() will not work because you are not always adding keys to the dictionary, sometimes you want to modify existing dictionary values.

Indeed, .... Neither update() or a dictionary comprehension are the right tools for this problem.
Here is the solution that works ...

list1 = [dir1, dir2, dir3, dir4, dir5]

files = {}
for folder in list1:
    for f in os.listdir(folder):
        if f in files:
            files[f] = files[f] + [folder]
        else:
            files[f] = [folder]

for k, v in files.items():
    if len(v) > 1:
        print(f'{k:<60}{len(v):<3}{v}')

**deanhystad** · (This post was last modified: Dec-27-2022, 04:49 PM by deanhystad.)

defaultdict can clean this part up.

for f in os.listdir(folder):
    if f in files:
        files[f] = files[f] + [folder]
    else:
        files[f] = [folder]
# Using a defaultdict to make the list as needed
files = collections.defaultdict(list)
for f in os.listdir(folder):
    files[f].append(folder)

files.append(folder) is about 10x faster than files = files + [folder], probably because you aren't disposing of the old list each time you append an item.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	rename same file names in different directories	elnk	5	2,593	Jul-12-2024, 01:43 PM Last Post: snippsat
	Trying to generating multiple json files using python script	dzgn989	4	2,631	May-10-2024, 03:09 PM Last Post: deanhystad
	[SOLVED] Loop through directories and files one level down?	Winfried	3	2,549	Apr-28-2024, 02:31 PM Last Post: Gribouillis
	Organization of project directories	wotoko	3	1,603	Mar-02-2024, 03:34 PM Last Post: Larz60+
	python convert multiple files to multiple lists	MCL169	6	3,461	Nov-25-2023, 05:31 AM Last Post: Iqratech
	splitting file into multiple files by searching for string	AlphaInc	2	3,140	Jul-01-2023, 10:35 PM Last Post: Pedroski55
	Merging multiple csv files with same X,Y,Z in each	Auz_Pete	3	3,478	Feb-21-2023, 04:21 AM Last Post: Auz_Pete
	Listing directories (as a text file)	kiwi99	1	1,477	Feb-17-2023, 12:58 PM Last Post: Larz60+
	unittest generates multiple files for each of my test case, how do I change to 1 file	zsousa	0	1,689	Feb-15-2023, 05:34 PM Last Post: zsousa
	Python: re.findall to find multiple instances don't work but search worked	Secret	1	2,074	Aug-30-2022, 08:40 PM Last Post: deanhystad

Find duplicate files in multiple directories

User Panel Messages

Announcements