Safe Haskell | Safe-Inferred |
---|---|
Language | Haskell2010 |
An efficient implementation of the Boyer-Moore string search algorithm. http://www-igm.univ-mlv.fr/~lecroq/string/node14.html#SECTION00140 https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string-search_algorithm
This is case insensitive variant of the algorithm which, unlike the case sensitive variant, has to be aware of the unicode code points that the bytes represent.
Synopsis
- data Automaton
- data CaseSensitivity
- newtype CodeUnitIndex = CodeUnitIndex {
- codeUnitIndex :: Int
- data Next a
- buildAutomaton :: Text -> Automaton
- patternLength :: Automaton -> CodeUnitIndex
- patternText :: Automaton -> Text
- runText :: forall a. a -> (a -> CodeUnitIndex -> CodeUnitIndex -> Next a) -> Automaton -> Text -> a
- minimumSkipForCodePoint :: CodePoint -> CodeUnitIndex
Documentation
A Boyer-Moore automaton is based on lookup-tables that allow skipping through the haystack. This allows for sub-linear matching in some cases, as we do not have to look at every input character.
NOTE: Unlike the AcMachine, a Boyer-Moore automaton only returns non-overlapping matches. This means that a Boyer-Moore automaton is not a 100% drop-in replacement for Aho-Corasick.
Returning overlapping matches would degrade the performance to O(nm) in pathological cases like
finding aaaa
in aaaaa....aaaaaa
as for each match it would scan back the whole m characters
of the pattern.
Instances
FromJSON Automaton Source # | |
ToJSON Automaton Source # | |
Defined in Data.Text.BoyerMooreCI.Automaton | |
Generic Automaton Source # | |
Show Automaton Source # | |
NFData Automaton Source # | |
Defined in Data.Text.BoyerMooreCI.Automaton | |
Eq Automaton Source # | |
Hashable Automaton Source # | |
Defined in Data.Text.BoyerMooreCI.Automaton | |
type Rep Automaton Source # | |
Defined in Data.Text.BoyerMooreCI.Automaton |
data CaseSensitivity Source #
Instances
newtype CodeUnitIndex Source #
An index into the raw UTF-8 data of a Text
. This is not the code point
index as conventionally accepted by Text
, so we wrap it to avoid confusing
the two. Incorrect index manipulation can lead to surrogate pairs being
sliced, so manipulate indices with care. This type is also used for lengths.
Instances
buildAutomaton :: Text -> Automaton Source #
patternLength :: Automaton -> CodeUnitIndex Source #
Length of the matched pattern measured in UTF-8 code units (bytes).
patternText :: Automaton -> Text Source #
Return the pattern that was used to construct the automaton, O(n).
runText :: forall a. a -> (a -> CodeUnitIndex -> CodeUnitIndex -> Next a) -> Automaton -> Text -> a Source #
Finds all matches in the text, calling the match callback with the first and last byte index of each match of the pattern.
minimumSkipForCodePoint :: CodePoint -> CodeUnitIndex Source #
Number of bytes that we can skip in the haystack if we want to skip no more than 1 pattern codepoint.
It must always be a low (safe) estimate, otherwise the algorithm can miss matches. It must account for any variation of upper/lower case characters that may occur in the haystack. In most cases, this is the same number of bytes as for the given codepoint
minimumSkipForCodePoint a
== 1
minimumSkipForCodePoint д
== 2
minimumSkipForCodePoint ⓟ
== 3
minimumSkipForCodePoint 🎄
== 4