Posts: 215
Threads: 55
Joined: Sep 2019
Hello,
This snippet allows to find duplicate files in multiple directories and count the number of duplicates.
import os
dir1 = '/...path1/'
dir2 = '/...path1/'
dir3 = '/...path3/'
dir4 = '/...path4/'
dir5 = '/...path5/'
list1 = [dir1, dir2, dir3, dir4, dir5]
files = []
for folder in list1:
files = files + [f for f in os.listdir(folder)]
dup = {x for x in files if files.count(x) > 1}
for item in sorted(dup):
print('{}\t{}'.format(item, files.count(item))) What I'm looking for is to create a dataset (maybe a dictionary) where each duplicate file is associated with the locations where the duplicates of the file occur.
Any ideas ?
Thanks.
Posts: 133
Threads: 0
Joined: Jun 2019
Hello,
possible blueprint:
Make files a Dictionary, create for each file a key with a list as the value and and the file's corresponding path to the list.
Once the dict is complete, you can iterate over the dict looking for values with a length > 1.
To iterate over a directory, you may want to use the newer / nicer pathlib module. To iterate over files in a directory non-recursively:
>>> from pathlib import Path
>>> p = Path('/path/to/dir')
>>> for child in p.iterdir():
... print(child) To iterate recursively:
>>> from pathlib import Path
>>> p = Path('path/ot/dir')
>>> for file in p.glob('*.*'):
... print(file) Regards, noisefloor
Posts: 6,779
Threads: 20
Joined: Feb 2020
Is this something like what you want to do?
from collections import defaultdict
groups = {
'A': (1, 2, 3),
'B': (1, 4, 5),
'C': (3, 5, 7),
'D': (1, 7)
}
numbers = defaultdict(list)
for group in groups:
for number in groups[group]:
numbers[number].append(group)
print(numbers) Output: defaultdict(<class 'list'>, {1: ['A', 'B', 'D'], 2: ['A'], 3: ['A', 'C'], 4: ['B'], 5: ['B', 'C'], 7: ['C', 'D']})
Except in your case the dictionary keys would be filenames and the dictionary values a list of folders that contain that filename?
You cannot sort a dictionary, but you can build a dictionary that is sorted. This sorts the dictionary items and creates a new dictionary from the sorted dictionary items.
def dict_sort(src, key=None, reverse=False):
return dict(sorted(src.items(), key=key, reverse=reverse))
print(dict_sort(numbers, key=lambda x: len(x[1]), reverse=True)) Output: {1: ['A', 'B', 'D'], 3: ['A', 'C'], 5: ['B', 'C'], 7: ['C', 'D'], 2: ['A'], 4: ['B']}
Posts: 215
Threads: 55
Joined: Sep 2019
(Dec-25-2022, 07:38 PM)noisefloor Wrote: Hello,
possible blueprint:
Make files a Dictionary, create for each file a key with a list as the value and and the file's corresponding path to the list.
Once the dict is complete, you can iterate over the dict looking for values with a length > 1.
Regards, noisefloor
Hello,
I tried this ... but 'location' values aren't added but updated.
list1 = [dir1, dir2, dir3, dir4, dir5]
files = {}
for folder in list1:
files.update({f:folder for f in os.listdir(folder)}) The comprehension expression has to be modified in some way.
Posts: 6,779
Threads: 20
Joined: Feb 2020
Neither update() or a dictionary comprehension are the right tools for this problem.
update() will not work because you are not always adding keys to the dictionary, sometimes you want to modify existing dictionary values.
Posts: 1,950
Threads: 8
Joined: Jun 2018
Is it about duplicate filenames or duplicate files?
Files with same name aren’t guaranteed to be duplicate files and vice versa - files with different names aren’t guaranteed not to be duplicate files.
I'm not 'in'-sane. Indeed, I am so far 'out' of sane that you appear a tiny blip on the distant coast of sanity. Bucky Katt, Get Fuzzy
Da Bishop: There's a dead bishop on the landing. I don't know who keeps bringing them in here. ....but society is to blame.
Posts: 215
Threads: 55
Joined: Sep 2019
(Dec-26-2022, 08:23 PM)perfringo Wrote: Is it about duplicate filenames or duplicate files?
Files with same name aren’t guaranteed to be duplicate files and vice versa - files with different names aren’t guaranteed not to be duplicate files. It's about duplicate files (sure, with the same filenames), located in different folders.
The problem is to get some kind of dataset where each duplicated file (or filename) is associated with locations where this file exist.
Posts: 7,313
Threads: 123
Joined: Sep 2016
(Dec-26-2022, 08:38 PM)Pavel_47 Wrote: It's about duplicate files (sure, with the same filenames), located in different folders.
The problem is to get some kind of dataset where each duplicated file (or filename) is associated with locations where this file exist. Should use hash value for a files.
Rather than identifying the file content bye name,extension,size or other designation which have no guarantees to work.
If use pathlib will give full path(recursively using rglob ) where duplicate exists,and hashlib work fine for this.
Exampke.
import hashlib
from pathlib import Path
def compute_hash(file_path):
with open(file_path, 'rb') as f:
hash_obj = hashlib.md5()
hash_obj.update(f.read())
return hash_obj.hexdigest()
def find_duplicate(root_path):
hashes = {}
for file_path in root_path.rglob('*'):
# print(file_path)
if file_path.is_file():
hash_value = compute_hash(file_path)
if hash_value in hashes:
print(f'Duplicate file found: {file_path}')
else:
hashes[hash_value] = file_path
if __name__ == '__main__':
root_path = Path(r'G:\div_code')
find_duplicate(root_path) Output: Duplicate file found: G:\div_code\test_cs\foo\bar.txt
Duplicate file found: G:\div_code\test_cs\foo\some_folder\egg2.txt
Posts: 215
Threads: 55
Joined: Sep 2019
(Dec-26-2022, 05:22 PM)deanhystad Wrote: Neither update() or a dictionary comprehension are the right tools for this problem.
update() will not work because you are not always adding keys to the dictionary, sometimes you want to modify existing dictionary values.
Indeed, .... Neither update() or a dictionary comprehension are the right tools for this problem.
Here is the solution that works ...
list1 = [dir1, dir2, dir3, dir4, dir5]
files = {}
for folder in list1:
for f in os.listdir(folder):
if f in files:
files[f] = files[f] + [folder]
else:
files[f] = [folder]
for k, v in files.items():
if len(v) > 1:
print(f'{k:<60}{len(v):<3}{v}')
Posts: 6,779
Threads: 20
Joined: Feb 2020
Dec-27-2022, 04:47 PM
(This post was last modified: Dec-27-2022, 04:49 PM by deanhystad.)
defaultdict can clean this part up.
for f in os.listdir(folder):
if f in files:
files[f] = files[f] + [folder]
else:
files[f] = [folder]
# Using a defaultdict to make the list as needed
files = collections.defaultdict(list)
for f in os.listdir(folder):
files[f].append(folder) files.append(folder) is about 10x faster than files = files + [folder], probably because you aren't disposing of the old list each time you append an item.
|