Safe Haskell | None |
---|---|
Language | Haskell98 |
Fully minimized and bit-packed directed acyclic word graphs.
This implementation mainly focuses on compactness (<500 Kb space for ~150000 word dictionaries) rather than genericity or dynamic usage. There are no insertion or deletion operations.
A DAWG node is stored in four bytes, using 22 bits for indexing and 8 bits for data storage. This implies that
The number of nodes shouldn't exceed 2^22, or 4194304.
- Input characters should be mapped to the 0-255 range.
- data Node
- fromList :: [String] -> Node
- fromAscList :: [String] -> Node
- fromFile :: FilePath -> IO Node
- char :: Node -> Char
- endOfWord :: Node -> Bool
- root :: Node -> Node
- children :: Node -> [Node]
- lookupPrefixBy :: (Char -> Char -> Ordering) -> String -> Node -> Maybe Node
- lookupPrefix :: String -> Node -> Maybe Node
- memberBy :: (Char -> Char -> Ordering) -> String -> Node -> Bool
- member :: String -> Node -> Bool
- toList :: Node -> [String]
- toFile :: FilePath -> Node -> IO ()
- pack :: Char -> Bool -> Bool -> Int -> Word32
- unpack :: Word32 -> NodeVector -> Node
- type NodeVector = Vector Word32
- nodeVector :: Node -> NodeVector
- endOfList :: Node -> Bool
- childIndex :: Node -> Word32
- getNodeAt :: NodeVector -> Word32 -> Node
Types
This data type points to a prefix in the DAWG. When a node is the root node it represents the whole DAWG. When it is non-root, it can be used to access the suffixes of the prefix pointed to by the node.
Construction
fromAscList :: [String] -> Node Source
Allows for faster DAWG generation than "fromList". The ordering assumption is unchecked.
Accessors
Get the character char of a node. The root nodes have the null character as char.
lookupPrefixBy :: (Char -> Char -> Ordering) -> String -> Node -> Maybe Node Source
Lookup a prefix by memberwise applying a comparison function. It is useful for
setting case sensitivity, e.g. insensitiveLookup = lookupPrefixBy (comparing toLower)
memberBy :: (Char -> Char -> Ordering) -> String -> Node -> Bool Source
Test for membership with a memberwise comparison function.
Conversions
toList :: Node -> [String] Source
Get the list of all suffixes that end on a valid word ending. When used on the root node this function enlists the original words. The resulting list is unsorted.
Internal
unpack :: Word32 -> NodeVector -> Node Source
Create a node from a Word32 and a NodeVector.
type NodeVector = Vector Word32 Source
The underlying container of the DAWG data. Modifying it will most likely result in an invalid DAWG.
Each Word32 represents a node. The format of a node is the following:
22 bits: the index of the first child.
- 8 bits: character data.
- 1 bit: end-of-word flag.
- 1 bit: end-of-childlist flag.
The children of a node are laid out next to each other, so they can be iterated over by starting from the first child and incrementing the index until a node with the end-of-childlist flag is found.
nodeVector :: Node -> NodeVector Source
Get the underlying vector from a node.
childIndex :: Node -> Word32 Source
Get the index of a node's first child node.
getNodeAt :: NodeVector -> Word32 -> Node Source
Create a node from some member of a NodeVector.