Safe Haskell	Safe-Inferred
Language	Haskell2010

Data.Suffix

Contents

Suffix array
Longest common prefix array
LLCP and RLCP arrays
Suffix tree
Pull
Intn

Description

Implementation

Suffix array construction uses the SAIS algorithm described by

Ge Nong, Sen Zhang, and Wai Hong Chan, "Linear Suffix Array Construction by Almost Pure Induced-Sorting", 2009 Data Compression Conference, https://doi.org/10.1109/DCC.2009.42

LCP array construction uses the $\varPhi$-algorithm described by

Juha Kärkkäinen, Giovanni Manzini, and Simon J. Puglisi "Permuted Longest-Common-Prefix Array", Annual Symposium on Combinatorial Pattern Matching, 2009, https://doi.org/10.1007/978-3-642-02441-2_17

The search algorithms used are described by

Udi Manber and Gene Myers, "Suffix arrays: a new method for on-line string searches", First annual ACM-SIAM symposium on Discrete algorithms, 1990, pp. 319-327, https://dl.acm.org/doi/10.5555/320176.320218

Synopsis

buildSuffixArray :: Intn i => Int -> Pull Int -> SuffixArray i
newtype SuffixArray i = SuffixArray (PrimArray i)
search :: (Ord a, Intn i) => Pull a -> SuffixArray i -> Pull a -> (Int, Int)
buildLCPArray :: (Eq a, Intn i) => Pull a -> SuffixArray i -> LCPArray i
newtype LCPArray i = LCPArray (PrimArray i)
buildLRLCPArray :: Intn i => LCPArray i -> LRLCPArrays i
data LRLCPArrays i = LRLCPArrays !(PrimArray i) !(PrimArray i)
searchLRLCP :: (Ord a, Intn i) => Pull a -> SuffixArray i -> LRLCPArrays i -> Pull a -> (Int, Int)
foldSuffixTree :: (Intn i, Monad m) => (Int -> m a) -> (Int -> m b) -> (Int -> b -> a -> m b) -> (Int -> b -> m a) -> SuffixArray i -> LCPArray i -> m a
data Pull a = Pull !Int (Int -> a)
pullFromByteString :: ByteString -> Pull Word8
pullFromPrimArray :: Prim a => PrimArray a -> Pull a
pullFromArray :: Array a -> Pull a
pullFromArrayLike :: (arr -> Int) -> (arr -> Int -> a) -> arr -> Pull a
class (Prim i, Integral i) => Intn i

Suffix array

buildSuffixArray Source #

Arguments

:: Intn i
=> Int	The alphabet size $k$.
-> Pull Int	Input sequence of length $n$. Indexing is assumed to be $O(1)$. Elements must be in `[0..k-1]`.
-> SuffixArray i	Output type `i` can be `Int` or `Int32`. If `i` is `Int32`, $n$ must be `<= (maxBound :: Int32)`.

$O(n + k)$. Build a suffix array from a sequence.

On 64-bit systems, a SuffixArray Int32 requires half the memory as a SuffixArray Int.

newtype SuffixArray i Source #

Suffix array.

Constructors

SuffixArray (PrimArray i)

Instances

Instances details

(Show i, Prim i) => Show (SuffixArray i) Source #
Instance details Defined in Data.Suffix Methods showsPrec :: Int -> SuffixArray i -> ShowS # show :: SuffixArray i -> String # showList :: [SuffixArray i] -> ShowS #
NFData (SuffixArray i) Source #
Instance details Defined in Data.Suffix Methods rnf :: SuffixArray i -> () #
(Eq i, Prim i) => Eq (SuffixArray i) Source #
Instance details Defined in Data.Suffix Methods (==) :: SuffixArray i -> SuffixArray i -> Bool # (/=) :: SuffixArray i -> SuffixArray i -> Bool #
(Ord i, Prim i) => Ord (SuffixArray i) Source #
Instance details Defined in Data.Suffix Methods compare :: SuffixArray i -> SuffixArray i -> Ordering # (<) :: SuffixArray i -> SuffixArray i -> Bool # (<=) :: SuffixArray i -> SuffixArray i -> Bool # (>) :: SuffixArray i -> SuffixArray i -> Bool # (>=) :: SuffixArray i -> SuffixArray i -> Bool # max :: SuffixArray i -> SuffixArray i -> SuffixArray i # min :: SuffixArray i -> SuffixArray i -> SuffixArray i #

search Source #

Arguments

:: (Ord a, Intn i)
=> Pull a	Sequence of length $n$. Indexing is assumed to be $O(1)$. `compare` for `a` is assumed to be $O(1)$.
-> SuffixArray i	Suffix array for the above sequence
-> Pull a	Pattern of length $m$
-> (Int, Int)	An `(offset, length)` pair, denoting a slice of the suffix array. Beginning at `offset`, `length` suffixes start with the pattern. `length` is 0 if the pattern does not occur in the sequence.

$O(m \log n)$. Search for a pattern in a sequence using its suffix array.

Note: For typical inputs, the worst case is unlikely and the running time is close to $O(m + \log n)$. To get guaranteed $O(m + \log n)$ running time consider using searchLRLCP instead.

Longest common prefix array

buildLCPArray Source #

Arguments

:: (Eq a, Intn i)
=> Pull a	Sequence of length `n`. Indexing is assumed to be $O(1)$. `compare` for `a` is assumed to be $O(1)$.
-> SuffixArray i	Suffix array for the above sequence
-> LCPArray i

$O(n)$. Build a longest common prefix array from a sequence and its suffix array.

The LCP array has the same length as the sequence, $n$. The $0$-th element of the LCP array is $0$. The $i$-th element of the LCP array for $0 < i < n$ is the longest common prefix of the $i$-th and $(i-1)$-th suffix in the suffix array.

newtype LCPArray i Source #

Longest common prefix array.

Constructors

LCPArray (PrimArray i)

Instances

Instances details

(Show i, Prim i) => Show (LCPArray i) Source #
Instance details Defined in Data.Suffix Methods showsPrec :: Int -> LCPArray i -> ShowS # show :: LCPArray i -> String # showList :: [LCPArray i] -> ShowS #
NFData (LCPArray i) Source #
Instance details Defined in Data.Suffix Methods rnf :: LCPArray i -> () #
(Eq i, Prim i) => Eq (LCPArray i) Source #
Instance details Defined in Data.Suffix Methods (==) :: LCPArray i -> LCPArray i -> Bool # (/=) :: LCPArray i -> LCPArray i -> Bool #
(Ord i, Prim i) => Ord (LCPArray i) Source #
Instance details Defined in Data.Suffix Methods compare :: LCPArray i -> LCPArray i -> Ordering # (<) :: LCPArray i -> LCPArray i -> Bool # (<=) :: LCPArray i -> LCPArray i -> Bool # (>) :: LCPArray i -> LCPArray i -> Bool # (>=) :: LCPArray i -> LCPArray i -> Bool # max :: LCPArray i -> LCPArray i -> LCPArray i # min :: LCPArray i -> LCPArray i -> LCPArray i #

LLCP and RLCP arrays

buildLRLCPArray :: Intn i => LCPArray i -> LRLCPArrays i Source #

$O(n)$. Build LLCP and RLCP arrays from an LCP array, for use in searchLRLCP.

data LRLCPArrays i Source #

LLCP and RLCP arrays.

Constructors

LRLCPArrays !(PrimArray i) !(PrimArray i)

Instances

Instances details

(Show i, Prim i) => Show (LRLCPArrays i) Source #
Instance details Defined in Data.Suffix Methods showsPrec :: Int -> LRLCPArrays i -> ShowS # show :: LRLCPArrays i -> String # showList :: [LRLCPArrays i] -> ShowS #
NFData (LRLCPArrays i) Source #
Instance details Defined in Data.Suffix Methods rnf :: LRLCPArrays i -> () #
(Eq i, Prim i) => Eq (LRLCPArrays i) Source #
Instance details Defined in Data.Suffix Methods (==) :: LRLCPArrays i -> LRLCPArrays i -> Bool # (/=) :: LRLCPArrays i -> LRLCPArrays i -> Bool #
(Ord i, Prim i) => Ord (LRLCPArrays i) Source #
Instance details Defined in Data.Suffix Methods compare :: LRLCPArrays i -> LRLCPArrays i -> Ordering # (<) :: LRLCPArrays i -> LRLCPArrays i -> Bool # (<=) :: LRLCPArrays i -> LRLCPArrays i -> Bool # (>) :: LRLCPArrays i -> LRLCPArrays i -> Bool # (>=) :: LRLCPArrays i -> LRLCPArrays i -> Bool # max :: LRLCPArrays i -> LRLCPArrays i -> LRLCPArrays i # min :: LRLCPArrays i -> LRLCPArrays i -> LRLCPArrays i #

searchLRLCP Source #

Arguments

:: (Ord a, Intn i)
=> Pull a	Sequence of length $n$. Indexing is assumed to be $O(1)$. `compare` for `a` is assumed to be $O(1)$.
-> SuffixArray i	Suffix array for the above sequence
-> LRLCPArrays i	LLCP and RLCP arrays for the above sequence
-> Pull a	Pattern sequence of length $m$
-> (Int, Int)	An `(offset, length)` pair, denoting a slice of the suffix array. Beginning at `offset`, `length` suffixes start with the pattern. `length` is 0 if the pattern does not occur in the sequence.

$O(m + \log n)$. Search for a pattern in a sequence using its suffix array and LLCP and RLCP arrays.

Suffix tree

foldSuffixTree Source #

Arguments

:: (Intn i, Monad m)
=> (Int -> m a)	Leaf. The `Int` is the suffix index.
-> (Int -> m b)	Internal node, initialize. The `Int` is the node's string depth.
-> (Int -> b -> a -> m b)	Internal node, combine with the value from a child. The `Int` is the node's string depth.
-> (Int -> b -> m a)	Internal node, finalize. The `Int` is the node's string depth.
-> SuffixArray i	A suffix array derived from a sequence
-> LCPArray i	The LCP array for the same sequence and suffix array
-> m a

$O(n)$. Bottom-up strict monadic fold of the suffix tree formed by the given suffix array and LCP array.

The tree folded over can be considered to be equivalent to the following definition.

data SuffixTree
  = Leaf
      Int          -- ^ The suffix index
  | Branch
      Int          -- ^ The string depth of this node
      [SuffixTree] -- ^ The children of this node

This tree has exactly $n+1$ leaves and at most $n$ internal nodes. The suffix with index $n$, i.e. the empty suffix, is a leaf in this tree, though it is not present in the suffix array.

The calls to the leaf function and the internal node finalizer taken together form a post-order traversal of the tree. For an internal node, the values from its children are combined left-to-right.

The $O(n)$ bound assumes that monadic bind and all given functions are $O(1)$.

Pull

data Pull a Source #

A pull array. Serves as a simple interface for array-like structures.

Note: For good performance, create Pulls right before supplying them to functions in this module.

Constructors

Pull
Fields !Int Length $n$. $n$ must be $\geq 0$. (Int -> a) Indexing function. Must be valid for inputs in `[0 .. n-1]`

Instances

Instances details

Functor Pull Source #
Instance details Defined in Data.Suffix Methods fmap :: (a -> b) -> Pull a -> Pull b # (<$) :: a -> Pull b -> Pull a #

pullFromByteString :: ByteString -> Pull Word8 Source #

Pull from a ByteString (from the bytestring package).

pullFromPrimArray :: Prim a => PrimArray a -> Pull a Source #

Pull from a PrimArray (from the primitive package).

pullFromArray :: Array a -> Pull a Source #

Pull from an Array (from the primitive package).

pullFromArrayLike Source #

Arguments

:: (arr -> Int)	Size function
-> (arr -> Int -> a)	Indexing function
-> arr	The structure
-> Pull a

Pull elements from any array-like structure.

Intn

class (Prim i, Integral i) => Intn i Source #

Allows for a choice between Int and Int32.

Minimal complete definition

toInt, frInt

Instances

Instances details

Intn Int32 Source #
Instance details Defined in Data.Suffix Methods toInt :: Int32 -> Int frInt :: Int -> Int32
Intn Int Source #
Instance details Defined in Data.Suffix Methods toInt :: Int -> Int frInt :: Int -> Int

:: Intn i
=> Int	The alphabet size \(k\).
-> Pull Int	Input sequence of length \(n\). Indexing is assumed to be \(O(1)\). Elements must be in `[0..k-1]`.
-> SuffixArray i	Output type `i` can be `Int` or `Int32`. If `i` is `Int32`, \(n\) must be `<= (maxBound :: Int32)`.

:: (Ord a, Intn i)
=> Pull a	Sequence of length \(n\). Indexing is assumed to be \(O(1)\). `compare` for `a` is assumed to be \(O(1)\).
-> SuffixArray i	Suffix array for the above sequence
-> Pull a	Pattern of length \(m\)
-> (Int, Int)	An `(offset, length)` pair, denoting a slice of the suffix array. Beginning at `offset`, `length` suffixes start with the pattern. `length` is 0 if the pattern does not occur in the sequence.

Pull
Fields !Int Length \(n\). \(n\) must be \(\geq 0\). (Int -> a) Indexing function. Must be valid for inputs in `[0 .. n-1]`