simple-atom-0.1.0.1: Atom (or symbol) datatype for fast comparision and sorting.

Portabilityportable
Stabilityexperimental
Maintainernominolo@gmail.com
Safe HaskellNone

Data.Atom.UF

Description

Symbols without a central symbol table.

Symbols provide the following efficient operations:

  • O(1) equality comparison (in practise) - O(1) ordering comparison (in practise) - O(n) creation

This can be implemented by using a global variable mapping strings to symbols and a counter assigning ids to symbols. However, this has two problems:

  1. It has a space leak. No symbols can ever be removed from this table. For example, if we add the symbol "foo" the first time it might get assigned id 1, if we then delete it and insert it again it might get assigned id 42. However, there may still be symbols in memory which got assigned id 1. Instead, symbols should be garbage collected like other data. Using weak pointers has bad effects on performance due to garbage collector overhead.
  2. It is not reliable to compare symbols created using different symbol tables. They would most likely get assigned different ids.

This implementation of symbols allows *optional* use of a symbol table. If a symbol table is used, this implementation will tend to use less memory and its operations will be a little bit faster at the beginning. For longer runs, it won't make a big difference though, since the representation is self-optimising.

Inspired by Richard OKeefes message to Erlang's eeps mailing list http://www.erlang.org/cgi-bin/ezmlm-cgi/5/057, which in turn was inspired by the Logix implementation of Flat Concurrent Prolog.

  • Implementation

Each symbol is represented a pointer to the symbol info, which consists of:

  • a String * a Hash * a null-able parent pointer to an equivalent symbol info

Creating the same symbol twice will at first be represented as two different entities.

            .----+-------+-----.
   A -----> | 42 | foo | nil |
            ----+-------+-----
   B --.
       '--> .----+-------+-----.
   C -----> | 42 | foo | nil |
            ----+-------+-----

(Note that A, B and C are IORefs.)

When comparing A and B we use the following properties:

  1. If A and B are identical then they must be equal.
  2. If they point to the same object, they must equal.
  3. If they have different hashes, they are different.

Unless there is a hash collision, we can decide equality and ordering for all symbols that have been built with the same hash table.

If the two objects have no parent, have the same hash, and the same string, we now make one the first the parent of the other and update the pointer of B accordingly. If there are no references to the second object left it can now be garbage collected.

If an object already has a parent pointer we follow each object's parents to the roots and compare the roots. This process might again result in updates to A or B and various parent pointers.

In the example above, after A == B we have:

            .----+-------+-----.
   A -----> | 42 | foo | nil |
       .--> ----+-------+-----
   B --'                    ^
            .----+-------+--|--.
   C -----> | 42 | foo |  *  |
            ----+-------+-----

After C == A or C == B we have.

   A -----> .----+-------+-----.
       .--> | 42 | foo | nil |
   B --'.-> ----+-------+-----
        |                   ^
        |   .----+-------+--|--.
   C ---'   | 42 | foo |  *  |
            ----+-------+-----

The second object will now be garbage collected.

In fact, after the first A == B, the remaining updates could use some help from the garbage collector. This could be done by somehow forcibly (and unsafely) replacing the second object by an update frame and then rely on the GC's indirection shortening feature. This is very unsafe, since some code may rely "know" that the object is already evaluated. E.g., C's pointer could be tagged (c.f. "Faster Laziness Using Dynamic Pointer Tagging"). It might work if we can match the physical layout of both structures, but it's equally likely that hell freezes over, so I'll leave that as an exercise for more braver hackers.

  • TODO
  • generalise to arbitrary hashable objects. need not be restricted to String.
  • make thread-safe. (we only need a lock for the uncommon cases)
  • make sure the pointer update code is correct and has no bad cases
  • implement IntMap variant/wrapper that respects that two different objects may have the same key (however unlikely).

Synopsis

Documentation

data Symbol Source

A symbol.

Instances

intern :: String -> SymbolSource

Create a new local symbol. For best performance use internInto together with a symbol table / map.

internInto :: SymTab s => s -> String -> (s, Symbol)Source

Insert a symbol into an existing table.

class SymTab s whereSource