SAMEFILE(1) IOCCC SAMEFILE(1)
User Manuals User Manuals
OCTOBER 1997
NAME
samefile - find identical files
SYNOPSIS
samefile
DESCRIPTION
samefile reads a list of file names (one file name per line) from
stdin. For each file name pair with identical contents, a line
consisting of six tab separated fields is output: The size in bytes,
two file names, the character ``='' if the two files are on the same
device, ``X'' otherwise, and the link counts of the two files. The
output is sorted in reverse order by size.
samefile uses two stages to give optimum performance.
In the first stage, all non-plain files are silently ignored
(directories, devices, FIFOs, sockets, symbolic links) as well as
files for which stat(2) or open(2) fails. The result of the first
stage (the file names) is written into a binary tree with one node for
every file size. Each node is in turn a linked list of file names
along with inode and device information. It is also at this first
stage where checks for hard links are done. For any inode only one
filename will be added to the binary tree.
In the second stage all files having the same size are compared
against each other. The rules of mathematical logic are applied to
reduce work and output noise: if files A, B, and C have the same size
and samefile finds that A = B and A = C then it will not compare B
against C (and will not output a line for B and C) but only for A = B
and A = C. The algorithm will detect equality across arbitrarily long
chains. Note however, that because only the first filename per inode
gets into the first stage, the output for a group of identical files
with different inode numbers is also minimized. Suppose you have six
identical files of size 100 in an inode group consisting of the three
inodes with numbers 10, 20 and 30:
$ ls -i # output edited for readabilty:
10 file1 20 file4 30 file6
10 file2 20 file5
10 file3
$ ls | samefile
100 file1 file4 = 3 2
100 file1 file6 = 3 1
The sum of the sizes in the first column is the amount of disk space
you could gain by making all 6 files links to only one file or remove
all but one of the files. To be precise, disk space is allocated in
blocks - you will probably gain two blocks here, rather than 200
- 1 - Formatted: November 6, 2008
SAMEFILE(1) IOCCC SAMEFILE(1)
User Manuals User Manuals
OCTOBER 1997
bytes. Note that it is not enough to just remove file4 and file6 (you
would gain only 100 bytes because file5 still exists.)
LIMITATIONS
Samefile was written with no limits in mind. The number of input lines
is unlimited. The size of the actual files is only limited by
available virtual memory needed to compare one pair of files. The
only hard limit is the constraint that there should not be more than
about 8192 files having the same size. Experience has shown that there
are rarely more than a couple dozen files of the same size.
EXAMPLES
For everybody:
What are the duplicates under my home directory?
$ find $HOME | samefile
For the sysadmin folks:
Report all duplicate files under /usr larger than 16k:
$ find /usr -size +16384c -a -type f | samefile
For the ftp and WWW admins:
How much space is wasted below our site's /pub directory?
$ find /pub -type f | samefile | awk '{sum += $1} END {print sum}'
EXIT STATUS
If samefile runs out of memory the exit status is 1, zero otherwise.
SEE ALSO
find(1), ln(1), rm(1)
- 2 - Formatted: November 6, 2008
|