| Safe Haskell | Safe-Inferred |
|---|---|
| Language | Haskell2010 |
Symbolize
Contents
Description
Implementation of a global Symbol Table, with garbage collection.
Symbols, also known as Atoms or Interned Strings, are a common technique to reduce memory usage and improve performance when using many small strings.
By storing a single copy of each encountered string in a global table and giving out indexes to that table, it is possible to compare strings for equality in constant time, instead of linear (in string size) time.
The main advantages of Symbolize over existing symbol table implementations are:
- Garbage collection: Symbols which are no longer used are automatically cleaned up.
Symbols have a memory footprint of exactly 1Wordand are nicely unpacked by GHC.- Support for any
Textualtype, includingString, (strict and lazy)Text, (strict and lazy)ByteStringetc. - Thread-safe.
- Efficient: Calls to
lookupanduninternare free of atomic memory barriers (and never have to wait on a concurrent thread runningintern) - Support for a maximum of 2^64 symbols at the same time (you'll probably run out of memory before that point).
Basic usage
This module is intended to be imported qualified, e.g.
import Symbolize (Symbol) import qualified Symbolize
To intern a string, use intern:
>>>hello = Symbolize.intern "hello">>>world = Symbolize.intern "world">>>(hello, world)(Symbolize.intern "hello",Symbolize.intern "world")
Interning supports any Textual type, so you can also use Text or ByteString etc.:
>>>import Data.Text (Text)>>>niceCheeses = fmap Symbolize.intern (["Roquefort", "Camembert", "Brie"] :: [Text])>>>niceCheeses[Symbolize.intern "Roquefort",Symbolize.intern "Camembert",Symbolize.intern "Brie"]
And if you are using OverloadedStrings, you can use the IsString instance to intern constants:
>>>hello2 = ("hello" :: Symbol)>>>hello2Symbolize.intern "hello">>>Symbolize.intern ("world" :: Text)Symbolize.intern "world"
Comparisons between symbols run in O(1) time:
>>>hello == hello2True>>>hello == worldFalse
To get back the textual value of a symbol, use unintern:
>>>Symbolize.unintern hello"hello"
If you want to check whether a string is currently interned, use lookup:
>>>Symbolize.lookup "hello"Just (Symbolize.intern "hello")
Symbols make great keys for HashMap and HashSet.
Hashing them is a no-op and they are guaranteed to be unique:
>>>Data.Hashable.hash hello0>>>fmap Data.Hashable.hash niceCheeses[2,3,4]
For introspection, you can look at how many symbols currently exist:
>>>Symbolize.globalSymbolTableSize5>>>[unintern (intern (show x)) | x <- [1..5]]["1","2","3","4","5"]>>>Symbolize.globalSymbolTableSize10
Unused symbols will be garbage-collected, so you don't have to worry about memory leaks:
>>>System.Mem.performGC>>>Symbolize.globalSymbolTableSize5
For deeper introspection, you can look at the Show instance of the global symbol table: (Note that the exact format is subject to change.)
>>>Symbolize.globalSymbolTableGlobalSymbolTable { count = 5, next = 10, contents = [(0,"hello"),(1,"world"),(2,"Roquefort"),(3,"Camembert"),(4,"Brie")] }
Synopsis
- data Symbol
- intern :: Textual s => s -> Symbol
- unintern :: Textual s => Symbol -> s
- lookup :: Textual s => s -> IO (Maybe Symbol)
- class Textual a where
- toShortText :: a -> ShortText
- fromShortText :: ShortText -> a
- data GlobalSymbolTable
- globalSymbolTable :: IO GlobalSymbolTable
- globalSymbolTableSize :: IO Word
Symbol
A string-like type with O(1) equality and comparison.
A Symbol represents a string (any Textual, so String, Text, ByteString etc.)
However, it only stores an (unpacked) Word, used as index into a global table in which the actual string value is stored.
Thus equality checks are constant-time, and its memory footprint is very low.
This is very useful if you're frequently comparing strings
and the same strings might come up many times.
It also makes Symbol a great candidate for a key in a HashMap or HashSet. (Hashing them is a no-op!)
The symbol table is implemented using weak pointers, which means that unused symbols will be garbage collected. As such, you do not need to be concerned about memory leaks.
Symbols are considered 'the same' regardless of whether they originate
from a String, (lazy or strict, normal or short) Text, (lazy or strict, normal or short) ByteString etc.
Symbolize supports up to 2^64 symbols existing at the same type. Your system will probably run out of memory before you reach that point.
Instances
| IsString Symbol Source # | |
Defined in Symbolize Methods fromString :: String -> Symbol # | |
| Read Symbol Source # | To be a good citizen w.r.t both
|
| Show Symbol Source # | |
| NFData Symbol Source # | Symbol contains only a strict |
| Eq Symbol Source # | Takes only O(1) time. |
| Ord Symbol Source # | Symbols are ordered by their Comparison takes O(n) time, as they are compared byte-by-byte. |
| Hashable Symbol Source # | Hashing a
|
| Display Symbol Source # |
|
Defined in Symbolize Methods displayBuilder :: Symbol -> Builder # displayList :: [Symbol] -> Builder # displayPrec :: Int -> Symbol -> Builder # | |
intern :: Textual s => s -> Symbol Source #
Intern a string-like value.
First converts s to a ShortText (if it isn't already one). See Textual for the type-specific time complexity of this.
Then, takes O(log16 n) time to look up the matching symbol and insert it if it did not exist yet (where n is the number of symbols currently in the table).
Any concurrent calls to (the critical section in) intern are synchronized.
unintern :: Textual s => Symbol -> s Source #
Unintern a symbol, returning its textual value. Takes O(log16 n) time to look up the matching textual value, where n is the number of symbols currently in the table.
Afterwards, the textual value is converted to the desired type s. See Textual for the type-specific time complexity.
Runs concurrently with any other operation on the symbol table, without any atomic memory barriers.
lookup :: Textual s => s -> IO (Maybe Symbol) Source #
Looks up a symbol in the global symbol table.
Returns Nothing if no such symbol currently exists.
Takes O(log16 n) time, where n is the number of symbols currently in the table.
Runs concurrently with any other operation on the symbol table, without any atomic memory barriers.
Because the result can vary depending on the current state of the symbol table, this function is not pure.
class Textual a where Source #
Implemented by any String-like types.
The symbol table uses ShortText for its internal storage, so any type which can be converted to it
can be turned to/from a Symbol.
Instance should handle potential invalid UTF-8 by using the Unicode replacement character,
c.f. lenientDecode.
Instances
| Textual ByteString Source # |
|
Defined in Symbolize.Textual Methods toShortText :: ByteString -> ShortText Source # fromShortText :: ShortText -> ByteString Source # | |
| Textual ShortByteString Source # |
|
Defined in Symbolize.Textual Methods | |
| Textual Text Source # |
|
Defined in Symbolize.Textual | |
| Textual Builder Source # |
|
Defined in Symbolize.Textual | |
| Textual Text Source # |
|
Defined in Symbolize.Textual | |
| Textual ShortText Source # |
|
Defined in Symbolize.Textual Methods toShortText :: ShortText -> ShortText Source # fromShortText :: ShortText -> ShortText Source # | |
| Textual String Source # |
|
Defined in Symbolize.Textual | |
Introspection & Metrics
data GlobalSymbolTable Source #
The global Symbol Table, containing a bidirectional mapping between each symbol's textual representation and its Word index.
You cannot manipulate the table itself directly,
but you can use globalSymbolTable to get a handle to it and use its Show instance for introspection.
globalSymbolTableSize can similarly be used to get the current size of the table.
Instances
| Show GlobalSymbolTable Source # | |
Defined in Symbolize Methods showsPrec :: Int -> GlobalSymbolTable -> ShowS # show :: GlobalSymbolTable -> String # showList :: [GlobalSymbolTable] -> ShowS # | |
globalSymbolTable :: IO GlobalSymbolTable Source #
Returns a handle to the global symbol table. (Only) useful for introspection or debugging.
globalSymbolTableSize :: IO Word Source #
Returns the current size of the global symbol table. Useful for introspection or metrics.