Safe Haskell | None |
---|---|

Language | Haskell98 |

Fully minimized and bit-packed directed acyclic word graphs.

This implementation mainly focuses on compactness (<500 Kb space for ~150000 word dictionaries) rather than genericity or dynamic usage. There are no insertion or deletion operations.

A DAWG node is stored in four bytes, using 22 bits for indexing and 8 bits for data storage. This implies that

- The number of nodes shouldn't exceed 2^22, or 4194304.
- Input characters should be mapped to the 0-255 range.

- data Node
- fromList :: [String] -> Node
- fromAscList :: [String] -> Node
- fromFile :: FilePath -> IO Node
- char :: Node -> Char
- endOfWord :: Node -> Bool
- root :: Node -> Node
- children :: Node -> [Node]
- lookupPrefixBy :: (Char -> Char -> Ordering) -> String -> Node -> Maybe Node
- lookupPrefix :: String -> Node -> Maybe Node
- memberBy :: (Char -> Char -> Ordering) -> String -> Node -> Bool
- member :: String -> Node -> Bool
- toList :: Node -> [String]
- toFile :: FilePath -> Node -> IO ()
- pack :: Char -> Bool -> Bool -> Int -> Word32
- unpack :: Word32 -> NodeVector -> Node
- type NodeVector = Vector Word32
- nodeVector :: Node -> NodeVector
- endOfList :: Node -> Bool
- childIndex :: Node -> Word32
- getNodeAt :: NodeVector -> Word32 -> Node

# Types

This data type points to a prefix in the DAWG. When a node is the root node it represents the whole DAWG. When it is non-root, it can be used to access the suffixes of the prefix pointed to by the node.

# Construction

fromAscList :: [String] -> Node Source

Allows for faster DAWG generation than `fromList`

. The input list must be in ascending order, but this is not checked.

# Accessors

lookupPrefixBy :: (Char -> Char -> Ordering) -> String -> Node -> Maybe Node Source

Lookup a prefix by memberwise applying a comparison function. It is useful for
setting case sensitivity, e.g. `insensitiveLookup = lookupPrefixBy (comparing toLower)`

memberBy :: (Char -> Char -> Ordering) -> String -> Node -> Bool Source

Test for membership with a memberwise comparison function.

# Conversions

toList :: Node -> [String] Source

Get the list of all suffixes that end on a valid word ending. When used on the root node this function enlists the original words. The resulting list is unsorted.

# Internal

unpack :: Word32 -> NodeVector -> Node Source

Create a node from a `Word32`

and a `NodeVector`

.

type NodeVector = Vector Word32 Source

The underlying container of the DAWG data. Modifying it will most likely result in an invalid DAWG.

Each `Word32`

represents a node. The format of a node is the following:

- 22 bits: the index of the first child.
- 8 bits: character data.
- 1 bit: end-of-word flag.
- 1 bit: end-of-childlist flag.

The children of a node are laid out next to each other, so they can be iterated over by starting from the first child and incrementing the index until a node with the end-of-childlist flag is found.

nodeVector :: Node -> NodeVector Source

Get the underlying vector from a node.

childIndex :: Node -> Word32 Source

Get the index of a node's first child node.

getNodeAt :: NodeVector -> Word32 -> Node Source

Create a node from some member of a NodeVector.