# cuckoo-filter

[![Hackage](https://img.shields.io/badge/Hackage-0.2.0.0-blue.svg)](https://hackage.haskell.org/package/cuckoo-filter)[![Build Status](https://travis-ci.org/ChrisCoffey/cuckoo-filter.svg?branch=master)](https://travis-ci.org/ChrisCoffey/cuckoo-filter)

Cuckoo filters are a probabilistic data structure used to answer questions like "Have I already seen this user" or "Is this word in the English language?". They're _probabilistic_ because each membership operation has a false positive probability. It guarnatees that there will never be a false negative, but may have a low chance of false positives.

Bloom filters are the cannonical probabilistic filter structure, and cuckoo filters are a simlar but different tool. As a bloom filter's load factor increases, the chance of false positive trends towards 100%, but the inserts will never fail. On the other hand, a Cuckoo filter retains a relatively stable false positive probability under load, but as load approahes 95% inserts will begin to fail. In either case you probably want to resize your filter...

This implementation has the following properties:
- Buckets of 4 elements
- 8 bit fingerprints
- Cycle termination during item kicking occurs after (0.1 * size) buckets have been checked.
- Size may be any non-zero natural number (not limited to powers of 2)

For more details about how Cuckoo filters work, I recommend you read Fan et. al.'s 2016 paper https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf.

### Usage
Cuckoo filters support three operations: `insert`, `member`, and `delete`. See the [haddocks](https://hackage.haskell.org/package/cuckoo-filter) for details.

### Performance
As you'll find in the criterion results, the pure version of the filter can handle ~1.6 million insertions/s. From memory profiles, the vast majority of the memory is taken up by the underlying implementation of `Filter`, so this is an obvious area for improvement.

The current implementation avoids pre-allocating memory for the filter, so the heap usage will incrase linearly with `insert` calls. This obviously helps keep heap usage low for sparse filters, but also means inserts are slower than they would be in a mutable implementation.


#### Loading a SpellChecker test
The following test was run on a laptop, so the absolute numbers are going to vary a ton. The important thing is the relationship between the pure & immutable filter implementations.

The test consists of:
1. Load the `/usr/share/dict/words` file into memory
2. Create a filter containing all of the words
3. Lookup each word in the filter


Pure
```
500000 cells
235886 words
0.078749ss to count words
0.933969ss to construct filter
745 insert failures
0.80465ss to query every element
```

Mutable
```
500000 cells
235886 words
0.082926ss to count words
0.29735ss to construct filter
582 insert failures
0.52605ss to query every element
```

Incredibly unscientific comparison to `bloom-filter` using a vanilla filter
```
235886 words
0.087499ss to count words
Bloom { 4194304 bits }
0.464982ss to construct filter
0.506902ss to query every element
```

*** Cuckoo Filters report the number of failures, while the Bloom Filter reports how many bits it contains. I'll start capturing size for the mutable Cuckoo Filter soon.