# cuckoo-filter [![Hackage](https://img.shields.io/badge/Hackage-0.2.0.0-blue.svg)](https://hackage.haskell.org/package/cuckoo-filter)[![Build Status](https://travis-ci.org/ChrisCoffey/cuckoo-filter.svg?branch=master)](https://travis-ci.org/ChrisCoffey/cuckoo-filter) Cuckoo filters are a probabilistic data structure used to answer questions like "Have I already seen this user" or "Is this word in the English language?". They're _probabilistic_ because each membership operation has a false positive probability. It guarnatees that there will never be a false negative, but may have a low chance of false positives. Bloom filters are the cannonical probabilistic filter structure, and cuckoo filters are a simlar but different tool. As a bloom filter's load factor increases, the chance of false positive trends towards 100%, but the inserts will never fail. On the other hand, a Cuckoo filter retains a relatively stable false positive probability under load, but as load approahes 95% inserts will begin to fail. In either case you probably want to resize your filter... This implementation has the following properties: - Buckets of 4 elements - 8 bit fingerprints - Cycle termination during item kicking occurs after (0.1 * size) buckets have been checked. - Size may be any non-zero natural number (not limited to powers of 2) For more details about how Cuckoo filters work, I recommend you read Fan et. al.'s 2016 paper https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf. ### Usage Cuckoo filters support three operations: `insert`, `member`, and `delete`. See the [haddocks](https://hackage.haskell.org/package/cuckoo-filter) for details. ### Performance As you'll find in the criterion results, the pure version of the filter can handle ~1.6 million insertions/s. From memory profiles, the vast majority of the memory is taken up by the underlying implementation of `Filter`, so this is an obvious area for improvement. The current implementation avoids pre-allocating memory for the filter, so the heap usage will incrase linearly with `insert` calls. This obviously helps keep heap usage low for sparse filters, but also means inserts are slower than they would be in a mutable implementation. #### Loading a SpellChecker test The following test was run on a laptop, so the absolute numbers are going to vary a ton. The important thing is the relationship between the pure & immutable filter implementations. The test consists of: 1. Load the `/usr/share/dict/words` file into memory 2. Create a filter containing all of the words 3. Lookup each word in the filter Pure ``` 500000 cells 235886 words 0.078749ss to count words 0.933969ss to construct filter 745 insert failures 0.80465ss to query every element ``` Mutable ``` 500000 cells 235886 words 0.082926ss to count words 0.29735ss to construct filter 582 insert failures 0.52605ss to query every element ``` Incredibly unscientific comparison to `bloom-filter` using a vanilla filter ``` 235886 words 0.087499ss to count words Bloom { 4194304 bits } 0.464982ss to construct filter 0.506902ss to query every element ``` *** Cuckoo Filters report the number of failures, while the Bloom Filter reports how many bits it contains. I'll start capturing size for the mutable Cuckoo Filter soon.