packed-dawg-0.1.0.0: Generation and traversal of highly compressed directed acyclic word graphs.

Safe HaskellNone

Data.DAWG.Packed

Contents

Description

Directed acyclic word graphs (DAWGs) are tries (see http://en.wikipedia.org/wiki/Trie) with merged identical nodes.

This implementation mainly focuses on compactness rather than genericity or dynamic usage. There are no insertion or deletion operations and the stored data must be mapped to 8-bit characters. On the flip side of the trade-off we can represent 150000+ word English dictionaries in less than 500 Kb.

This implementation stores a DAWG node in four bytes, using 22 bits for indexing and 8 bits for data storage. This implies that

  • The number of nodes shouldn't exceed 2^22, or 4194304.
  • Input characters should be mapped to the 0-255 range.

Synopsis

Types

data Node Source

This data type points to a prefix in the DAWG. When a node is the root node it represents the whole DAWG. When it is non-root, it can be used to access the suffixes of the prefix pointed to by the node.

Construction

fromAscList :: [String] -> NodeSource

Allows for faster DAWG generation than fromList. The ordering assumption is unchecked and a violation leads to invalid output.

fromFile :: FilePath -> IO NodeSource

Read a DAWG previously serialized with toFile from a file.

Accessors

value :: Node -> CharSource

Get the character value of a node. The root nodes have the null character as value.

endOfWord :: Node -> BoolSource

Indicates whether a prefix pointed to by the node is a valid word.

getRoot :: Node -> NodeSource

Get the root node from a node.

getChildren :: Node -> [Node]Source

Generate a list of the direct children of a node.

lookupPrefixBy :: (Char -> Char -> Bool) -> String -> Node -> Maybe NodeSource

Lookup a prefix by elementwise applying a comparison function. It is useful for setting case sensitivity, e.g. insensitiveLookup = lookupPrefixBy (comparing toLower)

lookupPrefix :: String -> Node -> Maybe NodeSource

lookupPrefix = lookupPrefixBy (==)

elemBy :: (Char -> Char -> Bool) -> String -> Node -> BoolSource

Test for membership with a comparison function.

elem :: String -> Node -> BoolSource

elem = elemBy (==)

Conversions

toList :: Node -> [String]Source

Get the list of all suffixes that end on a valid word ending. When used on the root node this function enlists the original words. The resulting list is unsorted.

toFile :: FilePath -> Node -> IO ()Source

Serialize a DAWG.

Internal

pack :: Char -> Bool -> Bool -> Int -> Word32Source

Create a bit-packed Word32.

unpack :: Word32 -> NodeVector -> NodeSource

Create a node from a Word32 and a NodeVector. It is assumed that the Word32 is actually contained in the NodeVector.

type NodeVector = Vector Word32Source

The underlying container of the DAWG data. Modifying it will most likely result in an invalid DAWG.

Each Word32 represents a node. The format of a node is the following:

  • 22 bits: the index of the first child.
  • 8 bits: character data.
  • 1 bit: end-of-word flag.
  • 1 bit: end-of-childlist flag.

The children of a node are laid out next to each other, so they can be iterated over by starting from the first child and incrementing the index until a node with the end-of-childlist flag is found.

nodeVector :: Node -> NodeVectorSource

Get the underlying vector from a node.

endOfList :: Node -> BoolSource

Indicates whether a node is the last in a list of children nodes.

childIndex :: Node -> Word32Source

Get the index of a node's first child node.

getNodeAt :: NodeVector -> Word32 -> NodeSource

Create a node from some element of a NodeVector.