Python Forum
a future project: hardlink identical files
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
a future project: hardlink identical files
#1
a project i am looking at doing in Python in the near future is a command script the recurses through a set of file trees it is pointed to, looks for identical files (optionally under select constraints such as the matched files must have exactly the same time), and hardlinks them together to minimize used space.  i've already done this in C so i think it should not be hard in Python.
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply
#2
one reason i wouldnt use this, is that you dont know which file will be the link and which is the real file.

certainly you can check if a file is a link before you remove it. but if you dont do that every time, you could find yourself deleting the one real file and then the others are no good either.

so what youre really doing is making it so every time you delete a file, you have to check that no links are relying on it. this is always true of links, but by increasing the number of links to "any files that have duplicates" you nearly defeat the purpose of having duplicates in the first place. i suppose in a few instances this would be worthwhile.
Reply
#3
(Dec-16-2017, 08:48 PM)ezdev Wrote: one reason i wouldnt use this, is that you dont know which file will be the link and which is the real file.

certainly you can check if a file is a link before you remove it. but if you dont do that every time, you could find yourself deleting the one real file and then the others are no good either.

so what youre really doing is making it so every time you delete a file, you have to check that no links are relying on it. this is always true of links, but by increasing the number of links to "any files that have duplicates" you nearly defeat the purpose of having duplicates in the first place. i suppose in a few instances this would be worthwhile.

this is about hard links, not symbolic links.   both being linked are real files.  if they are already linked, then it would be an optimization to not try linking them if they can be detected as already linked by not doing any more syscalls in O(1) time.  this kind of linking just makes 2 paths (names) that previously referenced (pointed to, or linked to) different inodes now reference just one of those inodes that have been compared and found to be identical.

usually, hardlinking two identical files won't matter.  but there are some odd cases to watch out for.  1, is if checking that metadata is identical, such as the timestamp, or the owner, id not allowed, the hardlinking step can result in a given file path effectively changing metadata. one metadata that would be wrong to compare is that reference to th inode, the inode number.  2, similarly, and perhaps more confusing to many, is if the file paths being linked already have other links out of the scope this run will be looking at.  but this all falls under the warning to not attempt this kind of compaction where link relations matter, such as a file designated to be where in-place changes are made where other linked pathes are expected to see the changes (or in special cases, not see them).  3, if the comparison compares two files and one of them is really a symlink.  this is really a metadata detection that should never be allowed to be disabled.  if both paths are already symlinks to different files, hardlinking the two symlinks (yes, you can do that in POSIX, BSD, and Linux) might seem right, but, can leave a dangling file, depending on how things are referenced in actual usage.

so a good version of this program would allow specifying what to do in the odd cases, and detecting cases of trouble.
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply
#4
i suppose then ive always confused the two, and always used symlinks. they suit my purposes, but they do have the drawback i already mentioned.

thanks for clarifying, it was informative and i (obviously) needed the refresher. cheers.
Reply
#5
symlinks don't operate exactly like the object it references.  for example, you can disable following symlinks and easily discover that it is a symlink and get the string of what the symlink references.  hardlinks are purely symmetrical.  if file "foo" and file "bar" are hardlinked, they now reference the same inode and share the same inode number.  there is no information of which link (reference) existed first.  the inode has a count of how many links to it there are.  data does not get de-allocated until that count falls to zero.  hardlinking directories has many fundamental problems and this is generally disallowed.  hardlinking other things generally works OK.
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  PEP 572 and Future svetlanarosemond 8 5,376 Jul-16-2018, 08:40 PM
Last Post: micseydel
  a future project for some day Skaperen 7 4,096 Apr-24-2018, 03:12 AM
Last Post: Skaperen
  a future project: recursive file list Skaperen 0 2,203 Dec-14-2017, 03:55 AM
Last Post: Skaperen

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020