Safe Haskell | None |
---|
Directed acyclic word graphs (DAWGs) are tries (see http://en.wikipedia.org/wiki/Trie) with merged identical nodes.
This implementation mainly focuses on compactness rather than genericity or dynamic usage. There are no insertion or deletion operations and the stored data must be mapped to 8-bit characters. On the flip side of the trade-off we can represent 150000+ word English dictionaries in less than 500 Kb.
This implementation stores a DAWG node in four bytes, using 22 bits for indexing and 8 bits for data storage. This implies that
- The number of nodes shouldn't exceed 2^22, or 4194304.
- Input characters should be mapped to the 0-255 range.
- data Node
- fromList :: [String] -> Node
- fromAscList :: [String] -> Node
- fromFile :: FilePath -> IO Node
- value :: Node -> Char
- endOfWord :: Node -> Bool
- getRoot :: Node -> Node
- getChildren :: Node -> [Node]
- lookupPrefixBy :: (Char -> Char -> Bool) -> String -> Node -> Maybe Node
- lookupPrefix :: String -> Node -> Maybe Node
- elemBy :: (Char -> Char -> Bool) -> String -> Node -> Bool
- elem :: String -> Node -> Bool
- toList :: Node -> [String]
- toFile :: FilePath -> Node -> IO ()
- pack :: Char -> Bool -> Bool -> Int -> Word32
- unpack :: Word32 -> NodeVector -> Node
- type NodeVector = Vector Word32
- nodeVector :: Node -> NodeVector
- endOfList :: Node -> Bool
- childIndex :: Node -> Word32
- getNodeAt :: NodeVector -> Word32 -> Node
Types
This data type points to a prefix in the DAWG. When a node is the root node it represents the whole DAWG. When it is non-root, it can be used to access the suffixes of the prefix pointed to by the node.
Construction
fromAscList :: [String] -> NodeSource
Allows for faster DAWG generation than fromList. The ordering assumption is unchecked and a violation leads to invalid output.
Accessors
Get the character value of a node. The root nodes have the null character as value.
getChildren :: Node -> [Node]Source
Generate a list of the direct children of a node.
lookupPrefixBy :: (Char -> Char -> Bool) -> String -> Node -> Maybe NodeSource
Lookup a prefix by elementwise applying a comparison function. It is useful for
setting case sensitivity, e.g. insensitiveLookup = lookupPrefixBy (comparing toLower)
elemBy :: (Char -> Char -> Bool) -> String -> Node -> BoolSource
Test for membership with a comparison function.
Conversions
toList :: Node -> [String]Source
Get the list of all suffixes that end on a valid word ending. When used on the root node this function enlists the original words. The resulting list is unsorted.
Internal
unpack :: Word32 -> NodeVector -> NodeSource
Create a node from a Word32 and a NodeVector. It is assumed that the Word32 is actually contained in the NodeVector.
type NodeVector = Vector Word32Source
The underlying container of the DAWG data. Modifying it will most likely result in an invalid DAWG.
Each Word32 represents a node. The format of a node is the following:
- 22 bits: the index of the first child.
- 8 bits: character data.
- 1 bit: end-of-word flag.
- 1 bit: end-of-childlist flag.
The children of a node are laid out next to each other, so they can be iterated over by starting from the first child and incrementing the index until a node with the end-of-childlist flag is found.
nodeVector :: Node -> NodeVectorSource
Get the underlying vector from a node.
childIndex :: Node -> Word32Source
Get the index of a node's first child node.
getNodeAt :: NodeVector -> Word32 -> NodeSource
Create a node from some element of a NodeVector.