takedouble: duplicate file finder

[ bsd3, library, program, utilities ] [ Propose Tags ]

takedouble is a fast duplicate file finder that filters by file size, first and last 4k chunks before checking the full contents of files that pass the filter.

[Skip to Readme]

Modules

[Index] [Quick Jump]

Takedouble

Downloads

takedouble-0.0.2.0.tar.gz [browse] (Cabal source package)
Package description (as included in the package)

Maintainer's Corner

Package maintainers

ShaeErisson

For package maintainers and hackage trustees

edit package information

Candidates

No Candidates

Versions [RSS]	0.0.1.1, 0.0.2.0
Change log	CHANGELOG.md
Dependencies	base (>=4.11 && <5), bytestring, directory, extra, filepath, filepattern, takedouble, unix [details]
License	BSD-3-Clause
Copyright	Shae Erisson
Author	Shae Erisson
Maintainer	Shae Erisson
Category	Utilities
Home page	https://github.com/shapr/takedouble
Source repo	head: git clone https://github.com/shapr/takedouble.git
Uploaded	by ShaeErisson at 2022-06-26T17:43:49Z
Distributions	NixOS:0.0.2.0
Executables	takedouble
Downloads	134 total (5 in the last 30 days)
Rating	(no votes yet) [estimated by Bayesian average]
Your Rating	λ λ λ
Status	Docs available [build log] Last success reported on 2022-06-26 [all 1 reports]

Readme for takedouble-0.0.2.0

[back to package description]

takedouble

TakeDouble is a duplicate file finder that reads and checks the filesize and first 4k and last 4k of a file and only then checks the full file to find duplicates.

How do I make it go?

You can use nix or cabal to build this.

cabal build should produce a binary. (use ghcup to install cabal and the latest GHC version).

After that, takedouble <dirname> so you could use takedouble ~/ for example.

If there are common files you'd like to exclude (such as .git directories) you can pass a glob to exclude any matching patterns from the output.

For example

takedouble <dirname> "**/.git/**"

Is it Fast?

On my ThinkPad with six Xeon cores, 128GB RAM, and a 1TB Samsung 970 Pro NVMe (via PCIe 3.0), I can check 34393 uncached files in 6.4 seconds. A second run on the same directory takes 2.8 seconds due to file metainfo cached in memory.