Safe Haskell | Safe-Inferred |
---|---|
Language | Haskell2010 |
Symbolize
Contents
Description
Implementation of a global Symbol Table, with garbage collection.
Symbols, also known as Atoms or Interned Strings, are a common technique to reduce memory usage and improve performance when using many small strings.
By storing a single copy of each encountered string in a global table and giving out indexes to that table, it is possible to compare strings for equality in constant time, instead of linear (in string size) time.
The main advantages of Symbolize over existing symbol table implementations are:
- Garbage collection: Symbols which are no longer used are automatically cleaned up.
Symbol
s have a memory footprint of exactly 1Word
and are nicely unpacked by GHC.- Support for any
Textual
type, includingString
, (strict and lazy)Text
, (strict and lazy)ByteString
etc. - Thread-safe.
- Efficient: Calls to
lookup
andunintern
are free of atomic memory barriers (and never have to wait on a concurrent thread runningintern
) - Support for a maximum of 2^64 symbols at the same time (you'll probably run out of memory before that point).
Basic usage
This module is intended to be imported qualified, e.g.
import Symbolize (Symbol) import qualified Symbolize
To intern a string, use intern
:
>>>
hello = Symbolize.intern "hello"
>>>
world = Symbolize.intern "world"
>>>
(hello, world)
(Symbolize.intern "hello",Symbolize.intern "world")
Interning supports any Textual
type, so you can also use Text
or ByteString
etc.:
>>>
import Data.Text (Text)
>>>
niceCheeses = fmap Symbolize.intern (["Roquefort", "Camembert", "Brie"] :: [Text])
>>>
niceCheeses
[Symbolize.intern "Roquefort",Symbolize.intern "Camembert",Symbolize.intern "Brie"]
And if you are using OverloadedStrings, you can use the IsString
instance to intern constants:
>>>
hello2 = ("hello" :: Symbol)
>>>
hello2
Symbolize.intern "hello">>>
Symbolize.intern ("world" :: Text)
Symbolize.intern "world"
Comparisons between symbols run in O(1) time:
>>>
hello == hello2
True>>>
hello == world
False
To get back the textual value of a symbol, use unintern
:
>>>
Symbolize.unintern hello
"hello"
If you want to check whether a string is currently interned, use lookup
:
>>>
Symbolize.lookup "hello"
Just (Symbolize.intern "hello")
Symbols make great keys for HashMap
and HashSet
.
Hashing them is a no-op and they are guaranteed to be unique:
>>>
Data.Hashable.hash hello
0>>>
fmap Data.Hashable.hash niceCheeses
[2,3,4]
For introspection, you can look at how many symbols currently exist:
>>>
Symbolize.globalSymbolTableSize
5>>>
[unintern (intern (show x)) | x <- [1..5]]
["1","2","3","4","5"]>>>
Symbolize.globalSymbolTableSize
10
Unused symbols will be garbage-collected, so you don't have to worry about memory leaks:
>>>
System.Mem.performGC
>>>
Symbolize.globalSymbolTableSize
5
For deeper introspection, you can look at the Show instance of the global symbol table: (Note that the exact format is subject to change.)
>>>
Symbolize.globalSymbolTable
GlobalSymbolTable { count = 5, next = 10, contents = [(0,"hello"),(1,"world"),(2,"Roquefort"),(3,"Camembert"),(4,"Brie")] }
Synopsis
- data Symbol
- intern :: Textual s => s -> Symbol
- unintern :: Textual s => Symbol -> s
- lookup :: Textual s => s -> IO (Maybe Symbol)
- class Textual a where
- toShortText :: a -> ShortText
- fromShortText :: ShortText -> a
- data GlobalSymbolTable
- globalSymbolTable :: IO GlobalSymbolTable
- globalSymbolTableSize :: IO Word
Symbol
A string-like type with O(1) equality and comparison.
A Symbol represents a string (any Textual
, so String, Text, ByteString etc.)
However, it only stores an (unpacked) Word
, used as index into a global table in which the actual string value is stored.
Thus equality checks are constant-time, and its memory footprint is very low.
This is very useful if you're frequently comparing strings
and the same strings might come up many times.
It also makes Symbol a great candidate for a key in a HashMap
or HashSet
. (Hashing them is a no-op!)
The symbol table is implemented using weak pointers, which means that unused symbols will be garbage collected. As such, you do not need to be concerned about memory leaks.
Symbols are considered 'the same' regardless of whether they originate
from a String
, (lazy or strict, normal or short) Text
, (lazy or strict, normal or short) ByteString
etc.
Symbolize supports up to 2^64 symbols existing at the same type. Your system will probably run out of memory before you reach that point.
Instances
IsString Symbol Source # | |
Defined in Symbolize Methods fromString :: String -> Symbol # | |
Read Symbol Source # | To be a good citizen w.r.t both
|
Show Symbol Source # | |
NFData Symbol Source # | Symbol contains only a strict |
Eq Symbol Source # | Takes only O(1) time. |
Ord Symbol Source # | Symbols are ordered by their Comparison takes O(n) time, as they are compared byte-by-byte. |
Hashable Symbol Source # | Hashing a
|
Display Symbol Source # |
|
Defined in Symbolize Methods displayBuilder :: Symbol -> Builder # displayList :: [Symbol] -> Builder # displayPrec :: Int -> Symbol -> Builder # |
intern :: Textual s => s -> Symbol Source #
Intern a string-like value.
First converts s to a ShortText
(if it isn't already one). See Textual
for the type-specific time complexity of this.
Then, takes O(log2(n)) time to look up the matching symbol and insert it if it did not exist yet (where n is the number of symbols currently in the table).
Any concurrent calls to (the critical section in) intern
are synchronized.
unintern :: Textual s => Symbol -> s Source #
Unintern a symbol, returning its textual value. Takes O(log16(n)) time to look up the matching textual value, where n is the number of symbols currently in the table.
Afterwards, the textual value is converted to the desired type s. See Textual
for the type-specific time complexity.
Runs concurrently with any other operation on the symbol table, without any atomic memory barriers.
lookup :: Textual s => s -> IO (Maybe Symbol) Source #
Looks up a symbol in the global symbol table.
Returns Nothing
if no such symbol currently exists.
Takes O(log2(n)) time, where n is the number of symbols currently in the table.
Runs concurrently with any other operation on the symbol table, without any atomic memory barriers.
Because the result can vary depending on the current state of the symbol table, this function is not pure.
class Textual a where Source #
Implemented by any String-like types.
The symbol table uses ShortText
for its internal storage, so any type which can be converted to it
can be turned to/from a Symbol
.
Instance should handle potential invalid UTF-8 by using the Unicode replacement character,
c.f. lenientDecode
.
Instances
Textual ByteString Source # |
|
Defined in Symbolize.Textual Methods toShortText :: ByteString -> ShortText Source # fromShortText :: ShortText -> ByteString Source # | |
Textual ShortByteString Source # |
|
Defined in Symbolize.Textual Methods | |
Textual Text Source # |
|
Defined in Symbolize.Textual | |
Textual Builder Source # |
|
Defined in Symbolize.Textual | |
Textual Text Source # |
|
Defined in Symbolize.Textual | |
Textual ShortText Source # |
|
Defined in Symbolize.Textual Methods toShortText :: ShortText -> ShortText Source # fromShortText :: ShortText -> ShortText Source # | |
Textual String Source # |
|
Defined in Symbolize.Textual |
Introspection & Metrics
data GlobalSymbolTable Source #
The global Symbol Table, containing a bidirectional mapping between each symbol's textual representation and its Word index.
You cannot manipulate the table itself directly,
but you can use globalSymbolTable
to get a handle to it and use its Show
instance for introspection.
globalSymbolTableSize
can similarly be used to get the current size of the table.
Current implementation details (these might change even between PVP-compatible versions):
- A (containers)
Map
is used for mapping text -> symbol. This has O(log2(n)) lookup time, but is resistent to HashDoS attacks. - A (unordered-containers)
HashMap
is used for mapping symbol -> text. This has O(log16(n)) lookup time. Because symbols are unique and their values are not user-generated, there is no danger of HashDoS here.
Instances
Show GlobalSymbolTable Source # | |
Defined in Symbolize Methods showsPrec :: Int -> GlobalSymbolTable -> ShowS # show :: GlobalSymbolTable -> String # showList :: [GlobalSymbolTable] -> ShowS # |
globalSymbolTable :: IO GlobalSymbolTable Source #
Returns a handle to the global symbol table. (Only) useful for introspection or debugging.
globalSymbolTableSize :: IO Word Source #
Returns the current size of the global symbol table. Useful for introspection or metrics.