Portability | portable |
---|---|

Stability | experimental |

Maintainer | bos@serpentine.com |

A lazy, efficient suffix tree implementation.

Since many function names (but not the type name) clash with
Prelude names, this module is usually imported `qualified`

, e.g.

import Data.SuffixTree (STree) import qualified Data.SuffixTree as T

The implementation is based on the first of those described in the following paper:

- Robert Giegerich and Stefan Kurtz, "/A comparison of imperative and purely functional suffix tree constructions/", Science of Computer Programming 25(2-3):187-218, 1995, http://citeseer.ist.psu.edu/giegerich95comparison.html

This implementation constructs the suffix tree lazily, so subtrees
are not created until they are traversed. Two construction
functions are provided, `constructWith`

for sequences composed of
small alphabets, and `construct`

for larger alphabets.

Estimates are given for performance. The value *k* is a constant;
*n* is the length of a query string; and *t* is the number of
elements (nodes and leaves) in a suffix tree.

- type Alphabet a = [a]
- type Edge a = (Prefix a, STree a)
- data Prefix a
- data STree a
- constructWith :: Eq a => Alphabet a -> [a] -> STree a
- construct :: Ord a => [a] -> STree a
- elem :: Eq a => [a] -> STree a -> Bool
- findEdge :: Eq a => [a] -> STree a -> Maybe (Edge a, Int)
- findTree :: Eq a => [a] -> STree a -> Maybe (STree a)
- findPath :: Eq a => [a] -> STree a -> [Edge a]
- countLeaves :: STree a -> Int
- countRepeats :: Eq a => [a] -> STree a -> Int
- foldr :: (Prefix a -> b -> b) -> b -> STree a -> b
- foldl :: (a -> Prefix b -> a) -> a -> STree b -> a
- fold :: (a -> a) -> (a -> a) -> (Prefix b -> a -> a -> a) -> (a -> a) -> a -> STree b -> a
- mkPrefix :: [a] -> Prefix a
- prefix :: Prefix a -> [a]
- suffixes :: [a] -> [[a]]

# Types

The list of symbols that `constructWith`

can possibly see in its
input.

# Construction

constructWith :: Eq a => Alphabet a -> [a] -> STree aSource

*O(k n log n)*. Constructs a suffix tree using the given
alphabet. The performance of this function is linear in the size
*k* of the alphabet. That makes this function suitable for small
alphabets, such as DNA nucleotides. For an alphabet containing
more than a few symbols, `construct`

is usually several orders of
magnitude faster.

# Querying

elem :: Eq a => [a] -> STree a -> BoolSource

*O(n)*. Indicates whether the suffix tree contains the given
sublist. Performance is linear in the length *n* of the
sublist.

findEdge :: Eq a => [a] -> STree a -> Maybe (Edge a, Int)Source

*O(n)*. Finds the given subsequence in the suffix tree. On
failure, returns `Nothing`

. On success, returns the `Edge`

in the
suffix tree at which the subsequence ends, along with the number of
elements to drop from the prefix of the `Edge`

to get the "real"
remaining prefix.

Here is an example:

> find "ssip" (construct "mississippi") Just ((mkPrefix "ppi",Leaf),1)

This indicates that the edge `(`

matches,
and that we must strip 1 character from the string `mkPrefix`

"ppi",`Leaf`

)`"ppi"`

to get
the remaining prefix string `"pi"`

.

Performance is linear in the length *n* of the query list.

findTree :: Eq a => [a] -> STree a -> Maybe (STree a)Source

*O(n)*. Finds the subtree rooted at the end of the given query
sequence. On failure, returns `Nothing`

.

Performance is linear in the length *n* of the query list.

findPath :: Eq a => [a] -> STree a -> [Edge a]Source

*O(n)*. Returns the path from the `Edge`

in the suffix tree at
which the given subsequence ends, up to the root of the tree. If
the subsequence is not found, returns the empty list.

Performance is linear in the length of the query list.

countLeaves :: STree a -> IntSource

*O(t)*. Count the number of leaves in a tree.

Performance is linear in the number *t* of elements in the tree.

countRepeats :: Eq a => [a] -> STree a -> IntSource

*O(n + r)*. Count the number of times a sequence is repeated
in the input sequence that was used to construct the suffix tree.

Performance is linear in the length *n* of the input sequence, plus
the number of times *r* the sequence is repeated.

# Traversal

foldr :: (Prefix a -> b -> b) -> b -> STree a -> bSource

*O(t)*. Folds the edges in a tree, using post-order traversal.
Suitable for lazy use.

*O(t)*. Folds the edges in a tree, using pre-order traversal. The
step function is evaluated strictly.

:: (a -> a) | downwards state transformer |

-> (a -> a) | upwards state transformer |

-> (Prefix b -> a -> a -> a) | edge state transformer |

-> (a -> a) | leaf state transformer |

-> a | initial state |

-> STree b | tree |

-> a |

*O(t)*. Generic fold over a tree.

A few simple examples.

countLeaves == fold id id (const const) (1+) 0

countEdges = fold id id (\_ a _ -> a+1) id 0

This more complicated example generates a tree of the same shape, but new type, with annotated leaves.

data GenTree a b = GenNode [(`Prefix`

a, GenTree a b)] | GenLeaf b deriving (`Show`

)

gentree ::`STree`

a -> GenTree a Int gentree =`fold`

reset id fprefix reset leaf where leaf = GenLeaf 1 reset =`const`

leaf fprefix p t (GenLeaf _) = GenNode [(p, t)] fprefix p t (GenNode es) = GenNode ((p, t):es)