Copyright | (c) Dong Han 2017-2018 |
---|---|
License | BSD |
Maintainer | winterland1989@gmail.com |
Stability | experimental |
Portability | non-portable |
Safe Haskell | Safe-Inferred |
Language | Haskell2010 |
Synopsis
- findIndices :: Vec v a => (a -> Bool) -> v a -> [Int]
- elemIndices :: (Vec v a, Eq a) => a -> v a -> [Int]
- find :: Vec v a => (a -> Bool) -> v a -> (Int, Maybe a)
- findR :: Vec v a => (a -> Bool) -> v a -> (Int, Maybe a)
- findIndex :: Vec v a => (a -> Bool) -> v a -> Int
- findIndexR :: Vec v a => (a -> Bool) -> v a -> Int
- filter :: forall v a. Vec v a => (a -> Bool) -> v a -> v a
- partition :: forall v a. Vec v a => (a -> Bool) -> v a -> (v a, v a)
- indicesOverlapping :: (Vec v a, Eq a) => v a -> v a -> Bool -> [Int]
- indices :: (Vec v a, Eq a) => v a -> v a -> Bool -> [Int]
- elemIndicesBytes :: Word8 -> Bytes -> [Int]
- findByte :: Word8 -> Bytes -> (Int, Maybe Word8)
- findByteR :: Word8 -> Bytes -> (Int, Maybe Word8)
- indicesOverlappingBytes :: Bytes -> Bytes -> Bool -> [Int]
- indicesBytes :: Bytes -> Bytes -> Bool -> [Int]
- kmpNextTable :: (Vec v a, Eq a) => v a -> PrimArray Int
- sundayBloom :: Bytes -> Word64
- elemSundayBloom :: Word64 -> Word8 -> Bool
Element-wise search
findIndices :: Vec v a => (a -> Bool) -> v a -> [Int] Source #
The findIndex
function takes a predicate and a vector and
returns the index of the first element in the vector
satisfying the predicate.
elemIndices :: (Vec v a, Eq a) => a -> v a -> [Int] Source #
O(n) The elemIndices
function extends elemIndex
, by returning
the indices of all elements equal to the query element, in ascending order.
find :: Vec v a => (a -> Bool) -> v a -> (Int, Maybe a) Source #
O(n) find the first index and element matching the predicate in a vector from left to right, if there isn't one, return (length of the vector, Nothing).
findR :: Vec v a => (a -> Bool) -> v a -> (Int, Maybe a) Source #
O(n) Find the first index and element matching the predicate in a vector from right to left, if there isn't one, return '(-1, Nothing)'.
filter :: forall v a. Vec v a => (a -> Bool) -> v a -> v a Source #
O(n) filter
, applied to a predicate and a vector,
returns a vector containing those elements that satisfy the
predicate.
partition :: forall v a. Vec v a => (a -> Bool) -> v a -> (v a, v a) Source #
O(n) The partition
function takes a predicate, a vector, returns
a pair of vector with elements which do and do not satisfy the
predicate, respectively; i.e.,
partition p vs == (filter p vs, filter (not . p) vs)
Sub-vector search
:: (Vec v a, Eq a) | |
=> v a | vector to search for ( |
-> v a | vector to search in ( |
-> Bool | report partial match at the end of haystack |
-> [Int] |
O(n+m) Find the offsets of all indices (possibly overlapping) of needle
within haystack
using KMP algorithm.
The KMP algorithm need pre-calculate a shift table in O(m) time and space, the worst case time complexity is O(n+m). Partial apply this function to reuse pre-calculated table between same needles.
Chunked input are support via partial match argument, if set we will return an extra negative index in case of partial match at the end of input chunk, e.g.
indicesOverlapping [ascii|ada|] [ascii|adadad|] True == [0,2,-2]
Where -2
is the length of the partial match part ad
's negation.
If an empty pattern is supplied, we will return every possible index of haystack, e.g.
indicesOverlapping "" "abc" = [0,1,2]
References:
- Knuth, Donald; Morris, James H.; Pratt, Vaughan: "Fast pattern matching in strings" (1977)
- http://www-igm.univ-mlv.fr/~lecroq/string/node8.html#SECTION0080
indices :: (Vec v a, Eq a) => v a -> v a -> Bool -> [Int] Source #
O(n+m) Find the offsets of all non-overlapping indices of needle
within haystack
using KMP algorithm.
If an empty pattern is supplied, we will return every possible index of haystack, e.g.
indicesOverlapping "" "abc" = [0,1,2]
Bytes
specialized combinators
elemIndicesBytes :: Word8 -> Bytes -> [Int] Source #
O(n) Special elemIndices
for Bytes
using memchr(3)
indicesOverlappingBytes Source #
:: Bytes | bytes to search for ( |
-> Bytes | bytes to search in ( |
-> Bool | report partial match at the end of haystack |
-> [Int] |
O(n/m) Find the offsets of all indices (possibly overlapping) of needle
within haystack
using KMP algorithm, combined with simplified sunday's
rule to obtain O(n/m) complexity in average use case.
The hybrid algorithm need pre-calculate a shift table in O(m) time and space, and a bad character bloom filter in O(m) time and O(1) space, the worst case time complexity is O(n+m).
References:
- Frantisek FranekChristopher G. JenningsWilliam F. Smyth A Simple Fast Hybrid Pattern-Matching Algorithm (2005)
- D. M. Sunday: A Very Fast Substring Search Algorithm. Communications of the ACM, 33, 8, 132-142 (1990)
- F. Lundh: The Fast Search Algorithm. http://effbot.org/zone/stringlib.htm (2006)
:: Bytes | bytes to search for ( |
-> Bytes | bytes to search in ( |
-> Bool | report partial match at the end of haystack |
-> [Int] |
O(n/m) Find the offsets of all non-overlapping indices of needle
within haystack
using KMP algorithm, combined with simplified sunday's
rule to obtain O(m/n) complexity in average use case.
Helpers
kmpNextTable :: (Vec v a, Eq a) => v a -> PrimArray Int Source #
O(m) Calculate the KMP next shift table.
The shifting rules is: when a mismatch between needle[j]
and haystack[i]
is found, check if next[j] == -1
, if so next search continue with needle[0]
and haystack[i+1]
, otherwise continue with needle[next[j]]
and haystack[i]
.
sundayBloom :: Bytes -> Word64 Source #
O(m) Calculate a simple bloom filter for simplified sunday's rule.
The shifting rules is: when a mismatch between needle[j]
and haystack[i]
is found, check if elemSundayBloom bloom haystack[i+n-j]
, where n is the
length of needle, if not then next search can be safely continued with
haystack[i+n-j+1]
and needle[0]
, otherwise next searh should continue with
haystack[i]
and needle[0]
, or fallback to other shifting rules such as KMP.
The algorithm is very simple: for a given Word8
w
, we set the bloom's bit
at unsafeShiftL 0x01 (w .&. 0x3f)
, so there're three false positives per bit.
This's particularly suitable for search UTF-8 bytes since the significant bits
of a beginning byte is usually the same.