cuckoo-filter: Pure and impure Cuckoo Filter

[ data, library, mit, program ] [ Propose Tags ]

Please see the README on Github at https://github.com/ChrisCoffey/cuckoo-filter#readme

[Skip to Readme]

Modules

[Index] [Quick Jump]

Data
- Data.CuckooFilter

Downloads

cuckoo-filter-0.2.0.1.tar.gz [browse] (Cabal source package)
Package description (as included in the package)

Maintainer's Corner

Package maintainers

ChrisCoffey

For package maintainers and hackage trustees

edit package information

Candidates

No Candidates

Versions [RSS]	0.1.0.0, 0.1.0.1, 0.1.0.2, 0.2.0.1, 0.2.0.2
Change log	ChangeLog.md
Dependencies	aeson, array, base (>=4.7 && <5), bytestring, cereal, containers, criterion, cuckoo-filter, hashable, random, time [details]
License	MIT
Copyright	2018 Chris Coffey
Author	Chris Coffey
Maintainer	chris@foldl.io
Category	Data
Home page	https://github.com/ChrisCoffey/cuckoo-filter#readme
Bug tracker	https://github.com/ChrisCoffey/cuckoo-filter/issues
Source repo	head: git clone https://github.com/ChrisCoffey/cuckoo-filter
Uploaded	by ChrisCoffey at 2018-11-09T03:47:21Z
Distributions
Executables	benchmarks
Downloads	2888 total (13 in the last 30 days)
Rating	(no votes yet) [estimated by Bayesian average]
Your Rating	λ λ λ
Status	Docs available [build log] Last success reported on 2018-11-09 [all 1 reports]

Readme for cuckoo-filter-0.2.0.1

[back to package description]

cuckoo-filter

Cuckoo filters are a probabilistic data structure used to answer questions like "Have I already seen this user" or "Is this word in the English language?". They're probabilistic because each membership operation has a false positive probability. It guarnatees that there will never be a false negative, but may have a low chance of false positives.

Bloom filters are the cannonical probabilistic filter structure, and cuckoo filters are a simlar but different tool. As a bloom filter's load factor increases, the chance of false positive trends towards 100%, but the inserts will never fail. On the other hand, a Cuckoo filter retains a relatively stable false positive probability under load, but as load approahes 95% inserts will begin to fail. In either case you probably want to resize your filter...

This implementation has the following properties:

Buckets of 4 elements
8 bit fingerprints
Cycle termination during item kicking occurs after (0.1 * size) buckets have been checked.
Size may be any non-zero natural number (not limited to powers of 2)

For more details about how Cuckoo filters work, I recommend you read Fan et. al.'s 2016 paper https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf.

Usage

Cuckoo filters support three operations: insert, member, and delete. See the haddocks for details.

Performance

As you'll find in the criterion results, the pure version of the filter can handle ~1.6 million insertions/s. From memory profiles, the vast majority of the memory is taken up by the underlying implementation of Filter, so this is an obvious area for improvement.

The current implementation avoids pre-allocating memory for the filter, so the heap usage will incrase linearly with insert calls. This obviously helps keep heap usage low for sparse filters, but also means inserts are slower than they would be in a mutable implementation.

Loading a SpellChecker test

The following test was run on a laptop, so the absolute numbers are going to vary a ton. The important thing is the relationship between the pure & immutable filter implementations.

The test consists of:

Load the /usr/share/dict/words file into memory
Create a filter containing all of the words
Lookup each word in the filter

Pure

500000 cells
235886 words
0.078749ss to count words
0.933969ss to construct filter
745 insert failures
0.80465ss to query every element

Mutable

500000 cells
235886 words
0.082926ss to count words
0.29735ss to construct filter
582 insert failures
0.52605ss to query every element

Incredibly unscientific comparison to bloom-filter using a vanilla filter

235886 words
0.087499ss to count words
Bloom { 4194304 bits }
0.464982ss to construct filter
0.506902ss to query every element

*** Cuckoo Filters report the number of failures, while the Bloom Filter reports how many bits it contains. I'll start capturing size for the mutable Cuckoo Filter soon.