Copyright | (c) Dong Han 2017-2018 |
---|---|

License | BSD |

Maintainer | winterland1989@gmail.com |

Stability | experimental |

Portability | non-portable |

Safe Haskell | None |

Language | Haskell2010 |

## Synopsis

- findIndices :: Vec v a => (a -> Bool) -> v a -> [Int]
- elemIndices :: (Vec v a, Eq a) => a -> v a -> [Int]
- find :: Vec v a => (a -> Bool) -> v a -> (Int, Maybe a)
- findR :: Vec v a => (a -> Bool) -> v a -> (Int, Maybe a)
- findIndex :: Vec v a => (a -> Bool) -> v a -> Int
- findIndexR :: Vec v a => (a -> Bool) -> v a -> Int
- filter :: forall v a. Vec v a => (a -> Bool) -> v a -> v a
- partition :: forall v a. Vec v a => (a -> Bool) -> v a -> (v a, v a)
- indicesOverlapping :: (Vec v a, Eq a) => v a -> v a -> Bool -> [Int]
- indices :: (Vec v a, Eq a) => v a -> v a -> Bool -> [Int]
- elemIndicesBytes :: Word8 -> Bytes -> [Int]
- findByte :: Word8 -> Bytes -> (Int, Maybe Word8)
- findByteR :: Word8 -> Bytes -> (Int, Maybe Word8)
- indicesOverlappingBytes :: Bytes -> Bytes -> Bool -> [Int]
- indicesBytes :: Bytes -> Bytes -> Bool -> [Int]
- kmpNextTable :: (Vec v a, Eq a) => v a -> PrimArray Int
- sundayBloom :: Bytes -> Word64
- elemSundayBloom :: Word64 -> Word8 -> Bool

# Element-wise search

findIndices :: Vec v a => (a -> Bool) -> v a -> [Int] Source #

The `findIndex`

function takes a predicate and a vector and
returns the index of the first element in the vector
satisfying the predicate.

elemIndices :: (Vec v a, Eq a) => a -> v a -> [Int] Source #

*O(n)* The `elemIndices`

function extends `elemIndex`

, by returning
the indices of all elements equal to the query element, in ascending order.

find :: Vec v a => (a -> Bool) -> v a -> (Int, Maybe a) Source #

*O(n)* find the first index and element matching the predicate in a vector
from left to right, if there isn't one, return (length of the vector, Nothing).

findR :: Vec v a => (a -> Bool) -> v a -> (Int, Maybe a) Source #

*O(n)* Find the first index and element matching the predicate
in a vector from right to left, if there isn't one, return '(-1, Nothing)'.

filter :: forall v a. Vec v a => (a -> Bool) -> v a -> v a Source #

*O(n)* `filter`

, applied to a predicate and a vector,
returns a vector containing those elements that satisfy the
predicate.

partition :: forall v a. Vec v a => (a -> Bool) -> v a -> (v a, v a) Source #

*O(n)* The `partition`

function takes a predicate, a vector, returns
a pair of vector with elements which do and do not satisfy the
predicate, respectively; i.e.,

partition p vs == (filter p vs, filter (not . p) vs)

# Sub-vector search

:: (Vec v a, Eq a) | |

=> v a | vector to search for ( |

-> v a | vector to search in ( |

-> Bool | report partial match at the end of haystack |

-> [Int] |

*O(n+m)* Find the offsets of all indices (possibly overlapping) of `needle`

within `haystack`

using KMP algorithm.

The KMP algorithm need pre-calculate a shift table in *O(m)* time and space,
the worst case time complexity is *O(n+m)*. Partial apply this function to
reuse pre-calculated table between same needles.

Chunked input are support via partial match argument, if set we will return an extra negative index in case of partial match at the end of input chunk, e.g.

indicesOverlapping [ascii|ada|] [ascii|adadad|] True == [0,2,-2]

Where `-2`

is the length of the partial match part `ad`

's negation.

If an empty pattern is supplied, we will return every possible index of haystack, e.g.

indicesOverlapping "" "abc" = [0,1,2]

References:

- Knuth, Donald; Morris, James H.; Pratt, Vaughan: "Fast pattern matching in strings" (1977)
- http://www-igm.univ-mlv.fr/~lecroq/string/node8.html#SECTION0080

indices :: (Vec v a, Eq a) => v a -> v a -> Bool -> [Int] Source #

*O(n+m)* Find the offsets of all non-overlapping indices of `needle`

within `haystack`

using KMP algorithm.

If an empty pattern is supplied, we will return every possible index of haystack, e.g.

indicesOverlapping "" "abc" = [0,1,2]

`Bytes`

specialized combinators

elemIndicesBytes :: Word8 -> Bytes -> [Int] Source #

*O(n)* Special `elemIndices`

for `Bytes`

using `memchr(3)`

indicesOverlappingBytes Source #

:: Bytes | bytes to search for ( |

-> Bytes | bytes to search in ( |

-> Bool | report partial match at the end of haystack |

-> [Int] |

*O(n/m)* Find the offsets of all indices (possibly overlapping) of `needle`

within `haystack`

using KMP algorithm, combined with simplified sunday's
rule to obtain *O(n/m)* complexity in average use case.

The hybrid algorithm need pre-calculate a shift table in *O(m)* time and space,
and a bad character bloom filter in *O(m)* time and *O(1)* space, the worst case
time complexity is *O(n+m)*.

References:

- Frantisek FranekChristopher G. JenningsWilliam F. Smyth A Simple Fast Hybrid Pattern-Matching Algorithm (2005)
- D. M. Sunday: A Very Fast Substring Search Algorithm. Communications of the ACM, 33, 8, 132-142 (1990)
- F. Lundh: The Fast Search Algorithm. http://effbot.org/zone/stringlib.htm (2006)

:: Bytes | bytes to search for ( |

-> Bytes | bytes to search in ( |

-> Bool | report partial match at the end of haystack |

-> [Int] |

*O(n/m)* Find the offsets of all non-overlapping indices of `needle`

within `haystack`

using KMP algorithm, combined with simplified sunday's
rule to obtain *O(m/n)* complexity in average use case.

# Helpers

kmpNextTable :: (Vec v a, Eq a) => v a -> PrimArray Int Source #

*O(m)* Calculate the KMP next shift table.

The shifting rules is: when a mismatch between `needle[j]`

and `haystack[i]`

is found, check if `next[j] == -1`

, if so next search continue with `needle[0]`

and `haystack[i+1]`

, otherwise continue with `needle[next[j]]`

and `haystack[i]`

.

sundayBloom :: Bytes -> Word64 Source #

*O(m)* Calculate a simple bloom filter for simplified sunday's rule.

The shifting rules is: when a mismatch between `needle[j]`

and `haystack[i]`

is found, check if `elemSundayBloom bloom haystack[i+n-j]`

, where n is the
length of needle, if not then next search can be safely continued with
`haystack[i+n-j+1]`

and `needle[0]`

, otherwise next searh should continue with
`haystack[i]`

and `needle[0]`

, or fallback to other shifting rules such as KMP.

The algorithm is very simple: for a given `Word8`

`w`

, we set the bloom's bit
at `unsafeShiftL 0x01 (w .&. 0x3f)`

, so there're three false positives per bit.
This's particularly suitable for search UTF-8 bytes since the significant bits
of a beginning byte is usually the same.