diff options
Diffstat (limited to 'README.md')
-rw-r--r-- | README.md | 96 |
1 files changed, 96 insertions, 0 deletions
diff --git a/README.md b/README.md new file mode 100644 index 0000000..d09bc87 --- /dev/null +++ b/README.md @@ -0,0 +1,96 @@ +Checkem +======= + +Find duplicate files efficiently, using Perl on Unix-like operating systems, +and maybe other ones too (untested). Requires only modules that have been in +Perl core since 5.7.3 at the latest. On earlier Perls, you will need to install +`Digest`. + +Requires at least one directory argument: + + $ checkem . + $ checkem ~tom ~chantelle + $ checkem /usr /usr/local + +You can install it in `/usr/local/bin` with: + + # make install + +You can define a `PREFIX` to install it elsewhere: + + $ make install PREFIX="$HOME"/.local + +There's a (presently) very basic test suite: + + $ make test + +Q&A +--- + +### Can I compare sets of files rather than sets of directories? + +Sure. This uses [`File::Find`][1] under the hood, which like POSIX +[`find(1)`][2] will still apply tests and actions to its initial arguments even +if they're not directories. This means you could do something like this to just +look for duplicate `.iso` files, provided you don't have more than `ARG_MAX`: + + $ checkem ~/media/*.iso + +Or even this, for a `find(1)` that supports the `+` terminator (POSIX): + + $ find ~/media -type f -name \*.iso -exec checkem {} + + +### Why is this faster than just hashing every file? + +It checks the size of each file first, and only ends up hashing them if they're +the same size but have different devices and/or inode numbers (i.e. they're not +hard links). Hashing is an expensive last resort, and in many situations this +won't end up running a single hash comparison. + +### I keep getting `.git` metadata files listed as duplicates. + +They're accurate, but you probably don't care. Filter them out by paragraph +block. If you have a POSIX-fearing `awk`, you could do something like this: + + $ checkem /dir | awk -v RS= -v ORS='\n\n' '!index($0,"/.git")' + +Or, if you were born after the Unix epoch: + + $ checkem /dir | perl -00 -ne 'print if 0>index $_,"/.git"' + +### How could I make it even quicker? + +Run it on a fast disk, mostly. For large directories or large files, it will +probably be I/O bound in most circumstances. + +If you end up hashing a lot of files because their sizes are the same, and +you're not worried about [SHA-1 technically being broken in practice][3], it's +a tiny bit faster: + + $ CHECKEM_ALG=SHA-1 checkem /dir + +Realistically, though, this is almost certainly splitting hairs. + +Theoretically, you could read only the first *n* bytes of each hash-needing +file and hash those with some suitable inexpensive function *f*, and just +compare those before resorting to checking the entire file with a safe hash +function *g*. You'd need to decide on suitable values for *n*, *f*, and *g* in +such a case; it might be useful for very large sets of files that will almost +certainly differ in the first *n* bytes. If there's interest in this at all, +I'll write it in as optional behaviour. + +Contributors +------------ + +* Timothy Goddard (pruby) fixed two bugs. + +License +------- + +Copyright (c) [Tom Ryder][4]. Distributed under an [MIT License][5]. + +[1]: https://metacpan.org/pod/File::Find +[2]: http://pubs.opengroup.org/onlinepubs/9699919799/utilities/find.html +[3]: https://shattered.io/ +[4]: https://sanctum.geek.nz/ +[5]: https://www.opensource.org/licenses/MIT |