packed-dawg-0.2.0.3: Generation and traversal of highly compressed directed acyclic word graphs.

Safe HaskellNone

Data.DAWG.Packed

Contents

Description

Fully minimized and bit-packed directed acyclic word graphs.

This implementation mainly focuses on compactness (<500 Kb space for ~150000 word dictionaries) rather than genericity or dynamic usage. There are no insertion or deletion operations.

A DAWG node is stored in four bytes, using 22 bits for indexing and 8 bits for data storage. This implies that

  • The number of nodes shouldn't exceed 2^22, or 4194304.
  • Input characters should be mapped to the 0-255 range.

Synopsis

Types

data Node Source

This data type points to a prefix in the DAWG. When a node is the root node it represents the whole DAWG. When it is non-root, it can be used to access the suffixes of the prefix pointed to by the node.

Construction

fromAscList :: [String] -> NodeSource

Allows for faster DAWG generation than fromList. The ordering assumption is unchecked.

fromFile :: FilePath -> IO NodeSource

Read a DAWG previously serialized with toFile from a file.

Accessors

char :: Node -> CharSource

Get the character char of a node. The root nodes have the null character as char.

endOfWord :: Node -> BoolSource

Indicates whether a prefix pointed to by the node is a valid word.

root :: Node -> NodeSource

Get the root node from a node.

children :: Node -> [Node]Source

Generate a list of the direct children of a node.

lookupPrefixBy :: (Char -> Char -> Ordering) -> String -> Node -> Maybe NodeSource

Lookup a prefix by memberwise applying a comparison function. It is useful for setting case sensitivity, e.g. insensitiveLookup = lookupPrefixBy (comparing toLower)

lookupPrefix :: String -> Node -> Maybe NodeSource

lookupPrefix = lookupPrefixBy (==)

memberBy :: (Char -> Char -> Ordering) -> String -> Node -> BoolSource

Test for membership with a memberwise comparison function.

member :: String -> Node -> BoolSource

member = memberBy (==)

Conversions

toList :: Node -> [String]Source

Get the list of all suffixes that end on a valid word ending. When used on the root node this function enlists the original words. The resulting list is unsorted.

toFile :: FilePath -> Node -> IO ()Source

Serialize a DAWG.

Internal

pack :: Char -> Bool -> Bool -> Int -> Word32Source

Create a bit-packed Word32.

unpack :: Word32 -> NodeVector -> NodeSource

Create a node from a Word32 and a NodeVector.

type NodeVector = Vector Word32Source

The underlying container of the DAWG data. Modifying it will most likely result in an invalid DAWG.

Each Word32 represents a node. The format of a node is the following:

  • 22 bits: the index of the first child.
  • 8 bits: character data.
  • 1 bit: end-of-word flag.
  • 1 bit: end-of-childlist flag.

The children of a node are laid out next to each other, so they can be iterated over by starting from the first child and incrementing the index until a node with the end-of-childlist flag is found.

nodeVector :: Node -> NodeVectorSource

Get the underlying vector from a node.

endOfList :: Node -> BoolSource

Indicates whether a node is the last in a list of children nodes.

childIndex :: Node -> Word32Source

Get the index of a node's first child node.

getNodeAt :: NodeVector -> Word32 -> NodeSource

Create a node from some member of a NodeVector.