adict-0.3.0: Approximate dictionary searching

Safe HaskellNone

NLP.Adict

Contents

Description

This module re-exports main data types and functions from the adict library.

Synopsis

Dictionary representation

The library provides two basic data structures used for dictionary representation. The first one is a Trie, which can be constructed from a list of dictionary entries by using the fromList function.

The trie can be translated into a directed acyclic word graph (DAWG) using the fromTrie function (for the moment it is done in an inefficient manner, though).

There is also a possibility of constructing an implicit DAWG, i.e. a DAWG which is algebraically represented by a trie with sharing of common subtries, by using the implicitDAWG function (which is also inefficient right now; in fact, the fromTrie function uses this one underneath).

Finally, the DAWG can be transformed back to a trie (implicit DAWG) using the fromDAWG function.

Trie

data Trie a b Source

A trie of words with character type a and entry type b. It represents a Map from [a] keys to b values.

Constructors

Trie 

Fields

rootValue :: b

Value in the root node.

edgeMap :: Map a (Trie a b)

Edges to subtries annotated with characters.

Instances

Functor (Trie a) 
(Eq a, Eq b) => Eq (Trie a b) 
(Eq (Trie a b), Ord a, Ord b) => Ord (Trie a b) 
(Show a, Show b) => Show (Trie a b) 
(Ord a, Binary a, Binary b) => Binary (Trie a b) 

type TrieM a b = Trie a (Maybe b)Source

A Trie with Maybe values in nodes.

fromList :: Ord a => [([a], b)] -> TrieM a bSource

Construct the trie from the list of (word, value) pairs.

implicitDAWG :: (Ord a, Ord b) => Trie a b -> Trie a bSource

Elminate common subtries. The result is algebraically a trie but is represented as a DAWG in memory.

Directed acyclic word graph

data DAWG a b Source

A directed acyclic word graph with character type a and dictionary entry type b. Each node is represented by a unique integer number which is also an index of the node in the vector of DAWG nodes.

Constructors

DAWG 

Fields

root :: Int

Root (index) of the DAWG

nodes :: Vector (Node a b)

Vector of DAWG nodes

Instances

(Ord a, Binary a, Binary b) => Binary (DAWG a b) 

data Node a b Source

A node in the DAWG.

Constructors

Node 

Fields

valueIn :: b

Value in the node.

subNodes :: Vector (a, Int)

Edges to subnodes (represented by DAWG node indices) annotated with characters.

type DAWGM a b = DAWG a (Maybe b)Source

A DAWGM is a DAWG with Maybe values in nodes.

fromTrie :: (Ord a, Ord b) => Trie a b -> DAWG a bSource

Find and eliminate all common subtries in the input trie and return the trie represented as a DAWG.

fromDAWG :: Ord a => DAWG a b -> Trie a bSource

Transform the DAWG to implicit DAWG in a form of a trie.

Approximate searching

There are three approximate searching methods implemented in the library. The first one, findAll, can be used to find all matches within the given distance from the query word. The findNearest function, on the other hand, searches only for the nearest to the query word match. The third one, bruteSearch, is provided only for reference and testing purposes.

The findAll function is evaluated against the Trie while the findNearest one is evaluated against the DAWG. The reason to make this distinction is that the findNearest function needs to distinguish between DAG nodes and to know when the particular node is visited for the second time.

Both methods perform the search with respect to the cost function specified by the library user, which can be used to customize weights of edit operations. The Cost structure provides the general representation of the cost and it can be used with the findAll method. The shortest-path algorithm used in the background of the findNearest function is optimized to use the more informative, CostDiv cost representation, which divides edit operations between separate classes with respect to their weight.

Cost function

type Word a = Vector aSource

A word parametrized with character type a.

type Pos = IntSource

Position in a sentence.

type Weight = DoubleSource

Cost of edit operation. It has to be a non-negative value!

data Cost a Source

Cost represents a cost (or weight) of a symbol insertion, deletion or substitution. It can depend on edit operation position and on symbol values.

Constructors

Cost 

Fields

insert :: Pos -> a -> Weight
 
delete :: Pos -> a -> Weight
 
subst :: Pos -> a -> a -> Weight
 

costDefault :: Eq a => Cost aSource

Simple cost function: all edit operations cost 1 unit.

Searching methods

bruteSearch :: Cost a -> Double -> Word a -> [(Word a, b)] -> [(Word a, b, Double)]Source

Find all words within a list with restricted generalized edit distance from x lower or equall to k.

findAll :: Cost a -> Double -> Word a -> TrieM a b -> [([a], b, Double)]Source

Find all words within a trie with restricted generalized edit distance lower or equall to k.

findNearest :: CostDiv a -> Double -> Word a -> DAWGM a b -> Maybe ([a], b, Double)Source

We can check, if CostDiv satisfies basic properties. On the other hand, we do not do this for plain Cost function.