Safe Haskell | None |
---|
NLP.Adict
Contents
Description
This module re-exports main data types and functions from the adict library.
- data Trie a b = Trie {}
- type TrieM a b = Trie a (Maybe b)
- fromList :: Ord a => [([a], b)] -> TrieM a b
- implicitDAWG :: (Ord a, Ord b) => Trie a b -> Trie a b
- data DAWG a b = DAWG {}
- data Node a b = Node {}
- type DAWGM a b = DAWG a (Maybe b)
- fromTrie :: (Ord a, Ord b) => Trie a b -> DAWG a b
- fromDAWG :: Ord a => DAWG a b -> Trie a b
- type Word a = Vector a
- type Pos = Int
- type Weight = Double
- data Cost a = Cost {}
- costDefault :: Eq a => Cost a
- bruteSearch :: Cost a -> Double -> Word a -> [(Word a, b)] -> [(Word a, b, Double)]
- findAll :: Cost a -> Double -> Word a -> TrieM a b -> [([a], b, Double)]
- findNearest :: CostDiv a -> Double -> Word a -> DAWGM a b -> Maybe ([a], b, Double)
Dictionary representation
The library provides two basic data structures used for dictionary
representation. The first one is a Trie
, which can be constructed
from a list of dictionary entries by using the fromList
function.
The trie can be translated into a directed acyclic word graph (DAWG
)
using the fromTrie
function (for the moment it is done in an
inefficient manner, though).
There is also a possibility of constructing an implicit DAWG, i.e. a DAWG
which is algebraically represented by a trie with sharing of common subtries,
by using the implicitDAWG
function (which is also inefficient right now;
in fact, the fromTrie
function uses this one underneath).
Finally, the DAWG can be transformed back to a trie (implicit DAWG) using
the fromDAWG
function.
Trie
A trie of words with character type a
and entry type b
.
It represents a Map
from [a]
keys to b
values.
Constructors
Trie | |
fromList :: Ord a => [([a], b)] -> TrieM a bSource
Construct the trie from the list of (word, value) pairs.
implicitDAWG :: (Ord a, Ord b) => Trie a b -> Trie a bSource
Elminate common subtries. The result is algebraically a trie but is represented as a DAWG in memory.
Directed acyclic word graph
A directed acyclic word graph with character type a
and dictionary
entry type b
. Each node is represented by a unique integer number
which is also an index of the node in the vector of DAWG nodes.
Constructors
DAWG | |
A node in the DAWG.
fromTrie :: (Ord a, Ord b) => Trie a b -> DAWG a bSource
Find and eliminate all common subtries in the input trie and return the trie represented as a DAWG.
fromDAWG :: Ord a => DAWG a b -> Trie a bSource
Transform the DAWG to implicit DAWG in a form of a trie.
Approximate searching
There are three approximate searching methods implemented in
the library. The first one, findAll
, can be used to find
all matches within the given distance from the query word.
The findNearest
function, on the other hand, searches only
for the nearest to the query word match.
The third one, bruteSearch
, is provided only for reference
and testing purposes.
The findAll
function is evaluated against the Trie
while the
findNearest
one is evaluated against the DAWG
.
The reason to make this distinction is that the findNearest
function needs to distinguish between DAG nodes and to know
when the particular node is visited for the second time.
Both methods perform the search with respect to the cost function
specified by the library user, which can be used to customize
weights of edit operations. The Cost
structure provides the
general representation of the cost and it can be used with
the findAll
method. The shortest-path algorithm used in
the background of the findNearest
function is optimized to
use the more informative, CostDiv
cost representation,
which divides edit operations between separate classes with
respect to their weight.
Cost function
Cost represents a cost (or weight) of a symbol insertion, deletion or substitution. It can depend on edit operation position and on symbol values.
costDefault :: Eq a => Cost aSource
Simple cost function: all edit operations cost 1 unit.
Searching methods
bruteSearch :: Cost a -> Double -> Word a -> [(Word a, b)] -> [(Word a, b, Double)]Source
Find all words within a list with restricted generalized edit distance from x lower or equall to k.