symbolize-0.1.0.3: Efficient global Symbol table, with Garbage Collection.
Safe HaskellSafe-Inferred
LanguageHaskell2010

Symbolize

Description

Implementation of a global Symbol Table, with garbage collection.

Symbols, also known as Atoms or Interned Strings, are a common technique to reduce memory usage and improve performance when using many small strings.

By storing a single copy of each encountered string in a global table and giving out indexes to that table, it is possible to compare strings for equality in constant time, instead of linear (in string size) time.

The main advantages of Symbolize over existing symbol table implementations are:

  • Garbage collection: Symbols which are no longer used are automatically cleaned up.
  • Symbols have a memory footprint of exactly 1 Word and are nicely unpacked by GHC.
  • Support for any Textual type, including String, (strict and lazy) Text, (strict and lazy) ByteString etc.
  • Thread-safe.
  • Efficient: Calls to lookup and unintern are free of atomic memory barriers (and never have to wait on a concurrent thread running intern)
  • Support for a maximum of 2^64 symbols at the same time (you'll probably run out of memory before that point).

Basic usage

This module is intended to be imported qualified, e.g.

import Symbolize (Symbol)
import qualified Symbolize

To intern a string, use intern:

>>> hello = Symbolize.intern "hello"
>>> world = Symbolize.intern "world"
>>> (hello, world)
(Symbolize.intern "hello",Symbolize.intern "world")

Interning supports any Textual type, so you can also use Text or ByteString etc.:

>>> import Data.Text (Text)
>>> niceCheeses = fmap Symbolize.intern (["Roquefort", "Camembert", "Brie"] :: [Text])
>>> niceCheeses
[Symbolize.intern "Roquefort",Symbolize.intern "Camembert",Symbolize.intern "Brie"]

And if you are using OverloadedStrings, you can use the IsString instance to intern constants:

>>> hello2 = ("hello" :: Symbol)
>>> hello2
Symbolize.intern "hello"
>>> Symbolize.intern ("world" :: Text)
Symbolize.intern "world"

Comparisons between symbols run in O(1) time:

>>> hello == hello2
True
>>> hello == world
False

To get back the textual value of a symbol, use unintern:

>>> Symbolize.unintern hello
"hello"

If you want to check whether a string is currently interned, use lookup:

>>> Symbolize.lookup "hello"
Just (Symbolize.intern "hello")

Symbols make great keys for HashMap and HashSet. Hashing them is a no-op and they are guaranteed to be unique:

>>> Data.Hashable.hash hello
0
>>> fmap Data.Hashable.hash niceCheeses
[2,3,4]

For introspection, you can look at how many symbols currently exist:

>>> Symbolize.globalSymbolTableSize
5
>>> [unintern (intern (show x)) | x <- [1..5]]
["1","2","3","4","5"]
>>> Symbolize.globalSymbolTableSize
10

Unused symbols will be garbage-collected, so you don't have to worry about memory leaks:

>>> System.Mem.performGC
>>> Symbolize.globalSymbolTableSize
5

For deeper introspection, you can look at the Show instance of the global symbol table: (Note that the exact format is subject to change.)

>>> Symbolize.globalSymbolTable
GlobalSymbolTable { count = 5, next = 10, contents = [(0,"hello"),(1,"world"),(2,"Roquefort"),(3,"Camembert"),(4,"Brie")] }
Synopsis

Symbol

data Symbol Source #

A string-like type with O(1) equality and comparison.

A Symbol represents a string (any Textual, so String, Text, ByteString etc.) However, it only stores an (unpacked) Word, used as index into a global table in which the actual string value is stored. Thus equality checks are constant-time, and its memory footprint is very low.

This is very useful if you're frequently comparing strings and the same strings might come up many times. It also makes Symbol a great candidate for a key in a HashMap or HashSet. (Hashing them is a no-op!)

The symbol table is implemented using weak pointers, which means that unused symbols will be garbage collected. As such, you do not need to be concerned about memory leaks.

Symbols are considered 'the same' regardless of whether they originate from a String, (lazy or strict, normal or short) Text, (lazy or strict, normal or short) ByteString etc.

Symbolize supports up to 2^64 symbols existing at the same type. Your system will probably run out of memory before you reach that point.

Instances

Instances details
IsString Symbol Source # 
Instance details

Defined in Symbolize

Methods

fromString :: String -> Symbol #

Read Symbol Source #

To be a good citizen w.r.t both Show and IsString, reading is supported two ways:

>>> read @Symbol "Symbolize.intern \"Haskell\""
Symbolize.intern "Haskell"
>>> read @Symbol "\"Curry\""
Symbolize.intern "Curry"
Instance details

Defined in Symbolize

Show Symbol Source # 
Instance details

Defined in Symbolize

NFData Symbol Source #

Symbol contains only a strict Word, so it is already fully evaluated.

Instance details

Defined in Symbolize

Methods

rnf :: Symbol -> () #

Eq Symbol Source #

Takes only O(1) time.

Instance details

Defined in Symbolize

Methods

(==) :: Symbol -> Symbol -> Bool #

(/=) :: Symbol -> Symbol -> Bool #

Ord Symbol Source #

Symbols are ordered by their ShortText representation.

Comparison takes O(n) time, as they are compared byte-by-byte.

Instance details

Defined in Symbolize

Hashable Symbol Source #

Hashing a Symbol is very fast:

hash is a no-op and results in zero collissions, as Symbol's index is unique and can be interpreted as a hash as-is.

hashWithSalt takes O(1) time; just as long as hashWithSalt-ing any other Word.

Instance details

Defined in Symbolize

Methods

hashWithSalt :: Int -> Symbol -> Int #

hash :: Symbol -> Int #

Display Symbol Source #
>>> Data.Text.Display.display (Symbolize.intern "Pizza")
"Pizza"
Instance details

Defined in Symbolize

intern :: Textual s => s -> Symbol Source #

Intern a string-like value.

First converts s to a ShortText (if it isn't already one). See Textual for the type-specific time complexity of this. Then, takes O(log2(n)) time to look up the matching symbol and insert it if it did not exist yet (where n is the number of symbols currently in the table).

Any concurrent calls to (the critical section in) intern are synchronized.

unintern :: Textual s => Symbol -> s Source #

Unintern a symbol, returning its textual value. Takes O(log16(n)) time to look up the matching textual value, where n is the number of symbols currently in the table.

Afterwards, the textual value is converted to the desired type s. See Textual for the type-specific time complexity.

Runs concurrently with any other operation on the symbol table, without any atomic memory barriers.

lookup :: Textual s => s -> IO (Maybe Symbol) Source #

Looks up a symbol in the global symbol table.

Returns Nothing if no such symbol currently exists.

Takes O(log2(n)) time, where n is the number of symbols currently in the table.

Runs concurrently with any other operation on the symbol table, without any atomic memory barriers.

Because the result can vary depending on the current state of the symbol table, this function is not pure.

class Textual a where Source #

Implemented by any String-like types. The symbol table uses ShortText for its internal storage, so any type which can be converted to it can be turned to/from a Symbol.

Instance should handle potential invalid UTF-8 by using the Unicode replacement character, c.f. lenientDecode.

Instances

Instances details
Textual ByteString Source #
  • toShortText: O(n). Turns invalid UTF-8 into the Unicode replacement character.
  • fromShortText: O(n).
Instance details

Defined in Symbolize.Textual

Textual ShortByteString Source #
  • toShortText: O(n). Turns invalid UTF-8 into the Unicode replacement character.
  • fromShortText: O(0) no-op
Instance details

Defined in Symbolize.Textual

Textual Text Source #
  • O(1) conversion
Instance details

Defined in Symbolize.Textual

Textual Builder Source #
  • toShortText: O(n). Evaluates the entire builder.
  • fromShortText: O(1)
Instance details

Defined in Symbolize.Textual

Textual Text Source #
  • O(1) conversion
Instance details

Defined in Symbolize.Textual

Textual ShortText Source #
  • O(0) conversion (a no-op)
Instance details

Defined in Symbolize.Textual

Textual String Source #
  • O(n) conversion
Instance details

Defined in Symbolize.Textual

Introspection & Metrics

data GlobalSymbolTable Source #

The global Symbol Table, containing a bidirectional mapping between each symbol's textual representation and its Word index.

You cannot manipulate the table itself directly, but you can use globalSymbolTable to get a handle to it and use its Show instance for introspection.

globalSymbolTableSize can similarly be used to get the current size of the table.

Current implementation details (these might change even between PVP-compatible versions):

  • A (containers) Map is used for mapping text -> symbol. This has O(log2(n)) lookup time, but is resistent to HashDoS attacks.
  • A (unordered-containers) HashMap is used for mapping symbol -> text. This has O(log16(n)) lookup time. Because symbols are unique and their values are not user-generated, there is no danger of HashDoS here.

Instances

Instances details
Show GlobalSymbolTable Source # 
Instance details

Defined in Symbolize

globalSymbolTable :: IO GlobalSymbolTable Source #

Returns a handle to the global symbol table. (Only) useful for introspection or debugging.

globalSymbolTableSize :: IO Word Source #

Returns the current size of the global symbol table. Useful for introspection or metrics.