Python Forum
a future project for some day
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
a future project for some day
#1
i did implement this in C but want to add features i didn't have time for, so i set it aside. soon, i want to re-do the whole thing in Python.

this program scans (iterates?) through a file tree keeping track of regular files as it goes, particularly inodes, sizes, and times. when it encounters a file with the same size and different inode (and different time unless opted not to consider this), it will read the files and calculate a (probably strong, can be specified) checksum. if the checksum has previously been calculated, then the saved copy will be used. if 2 files (different inodes) are found to have the same checksum (and maybe also the same date/time, definitely also the same size and same filesystem), they will be assumed to be identical and will be hard linked together. the goal is to avoid duplicate space in subtrees where hardlinking doesn't matter.
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply
#2
Your problem has probably been solved many times in the past by various different methods by members of the UNIX community. If you don't have time for a Python rewrite you might want to look at fdupes:
https://en.wikipedia.org/wiki/Fdupes
https://github.com/adrianlopezroche/fdupes

Lewis

Additional possibilites:
fslint: http://www.pixelbeat.org/fslint/
dupeguru (written in Python 3): https://dupeguru.voltaicideas.net/
To paraphrase: 'Throw out your dead' code. https://www.youtube.com/watch?v=grbSQ6O6kbs Forward to 1:00
Reply
#3
(Apr-22-2018, 01:49 PM)ljmetzger Wrote: you might want to look at fdupes:
Note that fdupes is an official package in the Ubuntu distribution of Linux.
Reply
#4
yeah, i know this has been solved many times before ... maybe even before i first did it in C. i will have a look at fdupes. is fdupes written in Python? or C? if neither of those then i don't want to use it (especially not if any part is in Perl).
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply
#5
fdupes comes as a binary, so probably C or C++. it has no option to hardlink duplicates though it does have an option to delete duplicates. not what i would want it to do.

so my project is back on.
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply
#6
(Apr-23-2018, 06:14 AM)Skaperen Wrote: it has no option to hardlink duplicates though it does have an option to delete duplicates
You could use it as an external command to locate the duplicates, it is quite fast.
Reply
#7
(Apr-23-2018, 06:39 AM)Gribouillis Wrote:
(Apr-23-2018, 06:14 AM)Skaperen Wrote: it has no option to hardlink duplicates though it does have an option to delete duplicates
You could use it as an external command to locate the duplicates, it is quite fast.

i still am inclined to make my own. i can boost the speed by doing only a partial checksum to begin with. the vast majority of files that are not identical have some difference very early on. also, i have had cases where i did not want to consider files to be identical (for hardlinking) unless they also have identical timestamps (i had this option on my C version).

if the first N bytes are not the same, no further checksumming is needed unless another file of the same size has the first N bytes the same (in which case i increase N and retry).

have you ever used fdupes? i'm playing around with it right now on a cloud instance that has a backup copy of my music collection, which has a lot of duplicate paths.
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply
#8
this is not as simple and straight forward as i had hoped. i am going to have to write some Python code for fdupes. many of my music file names cannot be passed to commands by shell without quoting (because of characters like parenthesis in the name) and fdupes offers no means to separately quote each name.

so i'll try this:
# -*- coding: utf-8 -*-
from __future__ import print_function
from subprocess import call
from sys import stderr, stdin
# do not use the -1 option on fdupes.
flist = []
for line in stdin:
    for c in '\n\r\n\r':
        if line[-1:] == c:
            line = line[:-1]
    l = len(line)
    if l < 1:
        if len(flist) < 2:
            print('flist =',repr(flist),file=stderr)
            raise Exception('fewer than 2 names in a group')
        a = flist[0]
        call(['ls','-dil',a])
        for n in flist[1:]:
            call(['ln','-fv',a,n])
            call(['ls','-dil',n])
        flist = []
    else:
        flist.append(line)
piping the output of fdupes to it.

nice ... this way worked.
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  PEP 572 and Future svetlanarosemond 8 5,376 Jul-16-2018, 08:40 PM
Last Post: micseydel
  a future project: hardlink identical files Skaperen 4 3,981 Dec-18-2017, 03:38 AM
Last Post: Skaperen
  a future project: recursive file list Skaperen 0 2,204 Dec-14-2017, 03:55 AM
Last Post: Skaperen

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020