Python Forum
check for duplicate file paths
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
check for duplicate file paths
#1
i have a list of files that are expected to exist in the host system name space. i want to check it for duplicate references including hardlinks. a duplicate may have a different path, perhaps using different symlinks to the same file. so my first thought is to get the file system reference and inode as a 2-tuple and build a set checking as i go or at least eventually if the set size is smaller. i may be doing cross-check between lists to act in different ways if the dup is in the same list or another.

i'm just wondering if anyone has any suggestions or thoughts before i dive into this.
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply
#2
Why not use an existing tool ? I have 3 of them in my bookmarks
  • fdupes which you can insctall in Ubuntu with sudo apt install fdupes
  • fslint, a program with a GUI interface.
  • dupeguru a possibly unmaintained Python program with GUI.
As the last commit is 6 days old, dupeguru finally looks alive and well..
Reply
#3
my goal is to build a list and process through that list to create a new list having only a single reference to each file (a dictionary indexed by those 2-tuples giving a path). nothing in the tree is to be modified. i will then do a statistical summary of the tree. that rules out fdupes which hardlinks identical files. the other 2 are GUI, which i quickly rule out. i need a Python function.

now days, when i make a command that is more that a quick few-liner, i code the core actions as a function (tests may be separate), a main() function that calls the action function and issues user messages for user errors, and a if __name__ == '__main__': section to call main(). that way i can use this work by importing it and calling the function or calling main(), if another project will expand on it. or i can extract the core function and add it to my function collection module (getting to be too big).

edit:

a set could be better than a list in certain cases above.
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply
#4
this is not about finding 2 or more files with duplicate content. this is about finding 2 or more references (a path as a str) in the list (or tuple or set or frozenset) that refer to the same file object even if the strings are different (there are many ways to do this). the end result returned or generated would be a new collection of references that are all unique. there will not be more strings than in the original collection; there could be fewer (duplicate references not included).
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply
#5
Perhaps you just want more_itertools.unique_everseen(), used with a key function that maps the strings in the list to a pair (fs referece, inode).
Skaperen likes this post
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  paths for venv Skaperen 5 1,798 Jun-14-2022, 10:04 AM
Last Post: snippsat

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020