Data.HashTable.ST.Linear

Description

An implementation of linear hash tables. (See http://en.wikipedia.org/wiki/Linear_hashing). Use this hash table if you...

care a lot about fitting your data set into memory; of the hash tables included in this collection, this one has the lowest space overhead
don't care that inserts and lookups are slower than the other hash table implementations in this collection (this one is slightly faster than Data.HashTable from the base library in most cases)
have a soft real-time or interactive application for which the risk of introducing a long pause on insert while all of the keys are rehashed is unacceptable.

Details:

Linear hashing allows for the expansion of the hash table one slot at a time, by moving a "split" pointer across an array of pointers to buckets. The number of buckets is always a power of two, and the bucket to look in is defined as:

bucket(level,key) = hash(key) mod (2^level)

The "split pointer" controls the expansion of the hash table. If the hash table is at level k (i.e. 2^k buckets have been allocated), we first calculate b=bucket(level-1,key). If b < splitptr, the destination bucket is calculated as b'=bucket(level,key), otherwise the original value b is used.

The split pointer is incremented once an insert causes some bucket to become fuller than some predetermined threshold; the bucket at the split pointer (*not* the bucket which triggered the split!) is then rehashed, and half of its keys can be expected to be rehashed into the upper half of the table.

When the split pointer reaches the middle of the bucket array, the size of the bucket array is doubled, the level increases, and the split pointer is reset to zero.

Linear hashing, although not quite as fast for inserts or lookups as the implementation of linear probing included in this package, is well suited for interactive applications because it has much better worst case behaviour on inserts. Other hash table implementations can suffer from long pauses, because it is occasionally necessary to rehash all of the keys when the table grows. Linear hashing, on the other hand, only ever rehashes a bounded (effectively constant) number of keys when an insert forces a bucket split.

Space overhead: experimental results

In randomized testing (see test/compute-overhead/ComputeOverhead.hs in the source distribution), mean overhead is approximately 1.51 machine words per key-value mapping with a very low standard deviation of about 0.06 words, 1.60 words per mapping at the 95th percentile.

Unsafe tricks

Then the unsafe-tricks flag is on when this package is built (and it is on by default), we use some unsafe tricks (namely unsafeCoerce# and reallyUnsafePtrEquality#) to save indirections in this table. These techniques rely on assumptions about the behaviour of the GHC runtime system and, although they've been tested and should be safe under normal conditions, are slightly dangerous. Caveat emptor. In particular, these techniques are incompatible with HPC code coverage reports.

References:

W. Litwin. Linear hashing: a new tool for file and table addressing. In Proc. 6th International Conference on Very Large Data Bases, Volume 6, pp. 212-223, 1980.
P-A. Larson. Dynamic hash tables. Communications of the ACM 31: 446-457, 1988.

Synopsis