Python Forum
Find duplicate files in multiple directories
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Find duplicate files in multiple directories
#1
Hello,

This snippet allows to find duplicate files in multiple directories and count the number of duplicates.
import os

dir1 = '/...path1/'
dir2 = '/...path1/'
dir3 = '/...path3/'
dir4 = '/...path4/'
dir5 = '/...path5/'

list1 = [dir1, dir2, dir3, dir4, dir5]

files = []
for folder in list1:
    files = files + [f for f in os.listdir(folder)]

dup = {x for x in files if files.count(x) > 1}

for item in sorted(dup):
    print('{}\t{}'.format(item, files.count(item)))
What I'm looking for is to create a dataset (maybe a dictionary) where each duplicate file is associated with the locations where the duplicates of the file occur.
Any ideas ?
Thanks.
Reply
#2
Hello,

possible blueprint:

Make files a Dictionary, create for each file a key with a list as the value and and the file's corresponding path to the list.
Once the dict is complete, you can iterate over the dict looking for values with a length > 1.

To iterate over a directory, you may want to use the newer / nicer pathlib module. To iterate over files in a directory non-recursively:

>>> from pathlib import Path
>>> p = Path('/path/to/dir')
>>> for child in p.iterdir():
...     print(child)
To iterate recursively:

>>> from pathlib import Path
>>> p = Path('path/ot/dir')
>>> for file in p.glob('*.*'):
...    print(file)
Regards, noisefloor
Reply
#3
Is this something like what you want to do?
from collections import defaultdict

groups = {
    'A': (1, 2, 3),
    'B': (1, 4, 5),
    'C': (3, 5, 7),
    'D': (1, 7)
}

numbers = defaultdict(list)
for group in groups:
    for number in groups[group]:
        numbers[number].append(group)
print(numbers)
Output:
defaultdict(<class 'list'>, {1: ['A', 'B', 'D'], 2: ['A'], 3: ['A', 'C'], 4: ['B'], 5: ['B', 'C'], 7: ['C', 'D']})
Except in your case the dictionary keys would be filenames and the dictionary values a list of folders that contain that filename?

You cannot sort a dictionary, but you can build a dictionary that is sorted. This sorts the dictionary items and creates a new dictionary from the sorted dictionary items.
def dict_sort(src, key=None, reverse=False):
    return dict(sorted(src.items(), key=key, reverse=reverse))

print(dict_sort(numbers, key=lambda x: len(x[1]), reverse=True))
Output:
{1: ['A', 'B', 'D'], 3: ['A', 'C'], 5: ['B', 'C'], 7: ['C', 'D'], 2: ['A'], 4: ['B']}
Reply
#4
(Dec-25-2022, 07:38 PM)noisefloor Wrote: Hello,
possible blueprint:
Make files a Dictionary, create for each file a key with a list as the value and and the file's corresponding path to the list.
Once the dict is complete, you can iterate over the dict looking for values with a length > 1.
Regards, noisefloor

Hello,
I tried this ... but 'location' values aren't added but updated.
list1 = [dir1, dir2, dir3, dir4, dir5]

files = {}
for folder in list1:
    files.update({f:folder for f in os.listdir(folder)})
The comprehension expression has to be modified in some way.
Reply
#5
Neither update() or a dictionary comprehension are the right tools for this problem.

update() will not work because you are not always adding keys to the dictionary, sometimes you want to modify existing dictionary values.
Pavel_47 likes this post
Reply
#6
Is it about duplicate filenames or duplicate files?

Files with same name aren’t guaranteed to be duplicate files and vice versa - files with different names aren’t guaranteed not to be duplicate files.
I'm not 'in'-sane. Indeed, I am so far 'out' of sane that you appear a tiny blip on the distant coast of sanity. Bucky Katt, Get Fuzzy

Da Bishop: There's a dead bishop on the landing. I don't know who keeps bringing them in here. ....but society is to blame.
Reply
#7
(Dec-26-2022, 08:23 PM)perfringo Wrote: Is it about duplicate filenames or duplicate files?

Files with same name aren’t guaranteed to be duplicate files and vice versa - files with different names aren’t guaranteed not to be duplicate files.
It's about duplicate files (sure, with the same filenames), located in different folders.
The problem is to get some kind of dataset where each duplicated file (or filename) is associated with locations where this file exist.
Reply
#8
(Dec-26-2022, 08:38 PM)Pavel_47 Wrote: It's about duplicate files (sure, with the same filenames), located in different folders.
The problem is to get some kind of dataset where each duplicated file (or filename) is associated with locations where this file exist.
Should use hash value for a files.
Rather than identifying the file content bye name,extension,size or other designation which have no guarantees to work.
If use pathlib will give full path(recursively using rglob) where duplicate exists,and hashlib work fine for this.
Exampke.
import hashlib
from pathlib import Path

def compute_hash(file_path):
    with open(file_path, 'rb') as f:
        hash_obj = hashlib.md5()
        hash_obj.update(f.read())
        return hash_obj.hexdigest()

def find_duplicate(root_path):
    hashes = {}
    for file_path in root_path.rglob('*'):
        # print(file_path)
        if file_path.is_file():
            hash_value = compute_hash(file_path)
            if hash_value in hashes:
                print(f'Duplicate file found: {file_path}')
            else:
                hashes[hash_value] = file_path

if __name__ == '__main__':
    root_path = Path(r'G:\div_code')
    find_duplicate(root_path)
Output:
Duplicate file found: G:\div_code\test_cs\foo\bar.txt Duplicate file found: G:\div_code\test_cs\foo\some_folder\egg2.txt
Reply
#9
(Dec-26-2022, 05:22 PM)deanhystad Wrote: Neither update() or a dictionary comprehension are the right tools for this problem.

update() will not work because you are not always adding keys to the dictionary, sometimes you want to modify existing dictionary values.

Indeed, .... Neither update() or a dictionary comprehension are the right tools for this problem.
Here is the solution that works ...
list1 = [dir1, dir2, dir3, dir4, dir5]

files = {}
for folder in list1:
    for f in os.listdir(folder):
        if f in files:
            files[f] = files[f] + [folder]
        else:
            files[f] = [folder]

for k, v in files.items():
    if len(v) > 1:
        print(f'{k:<60}{len(v):<3}{v}')
Reply
#10
defaultdict can clean this part up.
for f in os.listdir(folder):
    if f in files:
        files[f] = files[f] + [folder]
    else:
        files[f] = [folder]
# Using a defaultdict to make the list as needed
files = collections.defaultdict(list)
for f in os.listdir(folder):
    files[f].append(folder)
files.append(folder) is about 10x faster than files = files + [folder], probably because you aren't disposing of the old list each time you append an item.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Organization of project directories wotoko 3 364 Mar-02-2024, 03:34 PM
Last Post: Larz60+
  python convert multiple files to multiple lists MCL169 6 1,436 Nov-25-2023, 05:31 AM
Last Post: Iqratech
  splitting file into multiple files by searching for string AlphaInc 2 816 Jul-01-2023, 10:35 PM
Last Post: Pedroski55
  Merging multiple csv files with same X,Y,Z in each Auz_Pete 3 1,088 Feb-21-2023, 04:21 AM
Last Post: Auz_Pete
  Listing directories (as a text file) kiwi99 1 802 Feb-17-2023, 12:58 PM
Last Post: Larz60+
  unittest generates multiple files for each of my test case, how do I change to 1 file zsousa 0 918 Feb-15-2023, 05:34 PM
Last Post: zsousa
  rename same file names in different directories elnk 0 680 Nov-04-2022, 05:23 PM
Last Post: elnk
  Python: re.findall to find multiple instances don't work but search worked Secret 1 1,173 Aug-30-2022, 08:40 PM
Last Post: deanhystad
  Extract parts of multiple log-files and put it in a dataframe hasiro 4 2,023 Apr-27-2022, 12:44 PM
Last Post: hasiro
  Search multiple CSV files for a string or strings cubangt 7 7,842 Feb-23-2022, 12:53 AM
Last Post: Pedroski55

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020