An implementation of linear hash tables. (See http://en.wikipedia.org/wiki/Linear_hashing). Use this hash table if you...
- care a lot about fitting your data set into memory; of the hash tables included in this collection, this one has the lowest space overhead
- don't care that inserts and lookups are slower than the other hash table
implementations in this collection (this one is slightly faster than
Data.HashTablefrom the base library in most cases)
- have a soft real-time or interactive application for which the risk of introducing a long pause on insert while all of the keys are rehashed is unacceptable.
Linear hashing allows for the expansion of the hash table one slot at a time, by moving a "split" pointer across an array of pointers to buckets. The number of buckets is always a power of two, and the bucket to look in is defined as:
bucket(level,key) = hash(key) mod (2^level)
The "split pointer" controls the expansion of the hash table. If the hash
table is at level
2^k buckets have been allocated), we first
b < splitptr, the destination bucket is
b'=bucket(level,key), otherwise the original value
b is used.
The split pointer is incremented once an insert causes some bucket to become fuller than some predetermined threshold; the bucket at the split pointer (*not* the bucket which triggered the split!) is then rehashed, and half of its keys can be expected to be rehashed into the upper half of the table.
When the split pointer reaches the middle of the bucket array, the size of the bucket array is doubled, the level increases, and the split pointer is reset to zero.
Linear hashing, although not quite as fast for inserts or lookups as the implementation of linear probing included in this package, is well suited for interactive applications because it has much better worst case behaviour on inserts. Other hash table implementations can suffer from long pauses, because it is occasionally necessary to rehash all of the keys when the table grows. Linear hashing, on the other hand, only ever rehashes a bounded (effectively constant) number of keys when an insert forces a bucket split.
Space overhead: experimental results
In randomized testing (see
test/compute-overhead/ComputeOverhead.hs in the
source distribution), mean overhead is approximately 1.51 machine words per
key-value mapping with a very low standard deviation of about 0.06 words, 1.60
words per mapping at the 95th percentile.
unsafe-tricks flag is on when this package is built (and it is on by
default), we use some unsafe tricks (namely
reallyUnsafePtrEquality#) to save indirections in this table. These
techniques rely on assumptions about the behaviour of the GHC runtime system
and, although they've been tested and should be safe under normal conditions,
are slightly dangerous. Caveat emptor. In particular, these techniques are
incompatible with HPC code coverage reports.
- W. Litwin. Linear hashing: a new tool for file and table addressing. In Proc. 6th International Conference on Very Large Data Bases, Volume 6, pp. 212-223, 1980.
- P-A. Larson. Dynamic hash tables. Communications of the ACM 31: 446-457, 1988.
- data HashTable s k v
- new :: ST s (HashTable s k v)
- newSized :: Int -> ST s (HashTable s k v)
- delete :: (Hashable k, Eq k) => HashTable s k v -> k -> ST s ()
- lookup :: (Eq k, Hashable k) => HashTable s k v -> k -> ST s (Maybe v)
- insert :: (Eq k, Hashable k) => HashTable s k v -> k -> v -> ST s ()
- mapM_ :: ((k, v) -> ST s b) -> HashTable s k v -> ST s ()
- foldM :: (a -> (k, v) -> ST s a) -> a -> HashTable s k v -> ST s a
- computeOverhead :: HashTable s k v -> ST s Double