Portability | portable |
---|---|
Stability | experimental |
Maintainer | nominolo@gmail.com |
Symbols without a central symbol table.
Symbols provide the following efficient operations:
- O(1) equality comparison (in practise)
- O(1) ordering comparison (in practise)
- O(n) creation where N is the size of the symbol descriptor.
Many implementations often have the additional property that each symbol descriptor only exists once in memory. This implementation slightly relaxes this property:
- A symbol descriptor is guaranteed to exists only once in memory if it has been created using the same symbol table. Furthermore, if two symbols created from different symbol tables are compared and their descriptors turn out to be equal, the symbols will share the descriptor after the comparison.
This allows the following additional properties not present in conventional implementations:
- No space leak. The symbol table can be discarded at any time.
- Symbols created using different symbol tables can be compared reliably.
- No global lock. (TODO: Well we might need one in the case of hash-collisions, but a lock-free implementation might be possible.)
Inspired by Richard OKeefe
s message to Erlang's eeps mailing list
http://www.erlang.org/cgi-bin/ezmlm-cgi/5/057, which in turn was
inspired by the Logix implementation of Flat Concurrent Prolog.
- data Symbol a
- intern :: (a -> Word64) -> a -> Symbol a
- internInto :: SymTab s => (a -> Word64) -> s a -> a -> (s a, Symbol a)
- class SymTab s where
- lookupSymbol :: s a -> a -> Maybe (Symbol a)
- insertSymbol :: a -> Symbol a -> s a -> s a
- symbolHash :: Symbol a -> Word64
Symbols
A symbol.
Note that the ordering on a
is not preserved on Symbol a
.
Symbols are ordered by their hashes, and only if the hashes are
equal will the ordering on a
be used. We have:
x == y ==> intern x == intern y let sx = intern x sy = intern y in (sx < sy) == ((symbolHash sy < symbolHash sx) || symbolHash sy == symbolHash sx && x < y)
intern :: (a -> Word64) -> a -> Symbol aSource
Create a new local symbol. For best performance use
internInto
together with a symbol table / map.
internInto :: SymTab s => (a -> Word64) -> s a -> a -> (s a, Symbol a)Source
Insert a symbol into an existing table.
lookupSymbol :: s a -> a -> Maybe (Symbol a)Source
insertSymbol :: a -> Symbol a -> s a -> s aSource
symbolHash :: Symbol a -> Word64Source
Returns the hash of the symbol.
Implementation
Each symbol is represented a mutable pointer to the symbol info and a hash. The symbol info might itself be a pointer to another (equal) symbol info.
When creating a new symbol (without looking it up in a symbol table), we compute its hash and create a new symbol info.
+----+---+ +-------+ A: | 42 | *-----> | "foo" | +----+---+ +-------+
We now know the following:
- If two symbols have the same reference, they are equal. (The
Eq
instance onIORef
s implements observable sharing.) - If two symbols have a different hash, they are different.
If neither of the above is true we either have a hash collision or the two objects are equal but were created using different symbol tables.
Let's consider the latter case:
+----+---+ +-------+ A: | 42 | *-----> | "foo" | +----+---+ +-------+ +----+---+ +-------+ B: | 42 | *-----> | "foo" | +----+---+ +-------+
We follow the symbol pointers and realise that the symbol descriptors are equal. We thus decide for one of them to be the canonical symbol descriptor and update the pointers:
+----+---+ +-------+ A: | 42 | *-----> | "foo" | +----+---+ .-> +-------+ | ^ +----+---+ | +---|---+ B: | 42 | *---' | * | +----+---+ +-------+
We change the other symbol descriptor to be a pointer to the canonical
descriptor, because there may be other pointers to this symbol
descriptor. Otherwise, the old symbol descriptor becomes garbage. We
now have only one "foo"
object left.
We can add a third rule for equality:
- If, following all pointers, two symbol descriptors are the same, then the two symbols are equal.
If this is not the case (e.g., in the case of a hash collision) we
call the compare
function of the symbol descriptor. A good hash
function is therefore important since in the case of a hash collision
we will always have to call the compare
function of the symbol
descriptor.
- * Hash Function
Assuming a good hash function (i.e., the hash is indistinguishable from a randomly generated number) we can use the birthday paradox to calculate the probability of a hash collision:
collision_prob :: Integer -> Integer -> Double collision_prob key_bits items = 1 - exp (fromIntegral (-items * (items - 1)) / fromIntegral (2 * key_space)) where key_space = 2 ^ key_bits :: Integer
E.g., collision_prob 32 50000 == 0.2525...
means that with
32 bit hashes and 50000 symbols, there is a 25 percent chance of a
hash collision.
- * Path Shortening
If symbols from several symbol tables are joined repeatedly, its symbol infos may develop into long chains. For this reason we update all pointers while following them.
That is, given we have the following state:
X+-sym+---+ A+-nfo-+ B+-------+ | 42 | *------->| *------> | "foo" | +----+---' +-----' '-------' Y+-sym+---+ C+-nfo-+ D+-------+ | 42 | *------->| *------> | "foo" | +----+---' +-----' '-------'
after x `compare` y
we have.
X+-sym+---+ B+-------+ A+-nfo-+ | 42 | *-------> | "foo" |<-------* | `----+---' +--> '-------' `-----' | ^ ^------+ Y+-sym+---+ | C+-nfo-+| D+---|---+ | 42 | *----' | *---' | * | `----+---' `-----' '-------'
These references can be updated concurrently and without a lock since their information content does not change. That is, the state
X.-sym+---. A.-nfo-. B.-------. | 42 | *------->| *------> | "foo" | `----+---' `-----' '-------'
Is semantically equivalent to the state:
X.-sym+---. A.-nfo-. B.-------. | 42 | *----. | *------> | "foo" | +----+---+ | +-----+ .-> +-------+ '-----------'
- TODO
- verify thread-safety
- make sure the pointer update code is correct and has no bad cases
- implement IntMap variant/wrapper that respects that two different objects may have the same hash (however unlikely).