packed-dawg-0.2.0.7: Generation and traversal of highly compressed directed acyclic word graphs.

Safe HaskellNone
LanguageHaskell98

Data.DAWG.Packed

Contents

Description

Fully minimized and bit-packed directed acyclic word graphs.

This implementation mainly focuses on compactness (<500 Kb space for ~150000 word dictionaries) rather than genericity or dynamic usage. There are no insertion or deletion operations.

A DAWG node is stored in four bytes, using 22 bits for indexing and 8 bits for data storage. This implies that

  • The number of nodes shouldn't exceed 2^22, or 4194304.
  • Input characters should be mapped to the 0-255 range.

Synopsis

Types

data Node Source

This data type points to a prefix in the DAWG. When a node is the root node it represents the whole DAWG. When it is non-root, it can be used to access the suffixes of the prefix pointed to by the node.

Construction

fromList :: [String] -> Node Source

Create a DAWG from a list of words.

fromAscList :: [String] -> Node Source

Allows for faster DAWG generation than fromList. The input list must be in ascending order, but this is not checked.

fromFile :: FilePath -> IO Node Source

Read a DAWG previously serialized with toFile from a file.

Accessors

char :: Node -> Char Source

Get the character of a node. The root nodes have the null character.

endOfWord :: Node -> Bool Source

Indicates whether a prefix pointed to by the node is a valid word.

root :: Node -> Node Source

Get the root node from a node.

children :: Node -> [Node] Source

Generate a list of the direct children of a node.

lookupPrefixBy :: (Char -> Char -> Ordering) -> String -> Node -> Maybe Node Source

Lookup a prefix by memberwise applying a comparison function. It is useful for setting case sensitivity, e.g. insensitiveLookup = lookupPrefixBy (comparing toLower)

lookupPrefix :: String -> Node -> Maybe Node Source

lookupPrefix = lookupPrefixBy (==)

memberBy :: (Char -> Char -> Ordering) -> String -> Node -> Bool Source

Test for membership with a memberwise comparison function.

member :: String -> Node -> Bool Source

member = memberBy (==)

Conversions

toList :: Node -> [String] Source

Get the list of all suffixes that end on a valid word ending. When used on the root node this function enlists the original words. The resulting list is unsorted.

toFile :: FilePath -> Node -> IO () Source

Serialize a DAWG.

Internal

pack :: Char -> Bool -> Bool -> Int -> Word32 Source

Create a bit-packed Word32.

unpack :: Word32 -> NodeVector -> Node Source

Create a node from a Word32 and a NodeVector.

type NodeVector = Vector Word32 Source

The underlying container of the DAWG data. Modifying it will most likely result in an invalid DAWG.

Each Word32 represents a node. The format of a node is the following:

  • 22 bits: the index of the first child.
  • 8 bits: character data.
  • 1 bit: end-of-word flag.
  • 1 bit: end-of-childlist flag.

The children of a node are laid out next to each other, so they can be iterated over by starting from the first child and incrementing the index until a node with the end-of-childlist flag is found.

nodeVector :: Node -> NodeVector Source

Get the underlying vector from a node.

endOfList :: Node -> Bool Source

Indicates whether a node is the last in a list of children nodes.

childIndex :: Node -> Word32 Source

Get the index of a node's first child node.

getNodeAt :: NodeVector -> Word32 -> Node Source

Create a node from some member of a NodeVector.