Safe Haskell	Safe-Inferred
Language	Haskell2010

Data.Text.BoyerMooreCI.Automaton

Description

An efficient implementation of the Boyer-Moore string search algorithm. http://www-igm.univ-mlv.fr/~lecroq/string/node14.html#SECTION00140 https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string-search_algorithm

This is case insensitive variant of the algorithm which, unlike the case sensitive variant, has to be aware of the unicode code points that the bytes represent.

Synopsis

data Automaton
data CaseSensitivity
- = CaseSensitive
- | IgnoreCase
newtype CodeUnitIndex = CodeUnitIndex {
- codeUnitIndex :: Int
}
data Next a
- = Done !a
- | Step !a
buildAutomaton :: Text -> Automaton
patternLength :: Automaton -> CodeUnitIndex
patternText :: Automaton -> Text
runText :: forall a. a -> (a -> CodeUnitIndex -> CodeUnitIndex -> Next a) -> Automaton -> Text -> a
minimumSkipForCodePoint :: CodePoint -> CodeUnitIndex

Documentation

data Automaton Source #

A Boyer-Moore automaton is based on lookup-tables that allow skipping through the haystack. This allows for sub-linear matching in some cases, as we do not have to look at every input character.

NOTE: Unlike the AcMachine, a Boyer-Moore automaton only returns non-overlapping matches. This means that a Boyer-Moore automaton is not a 100% drop-in replacement for Aho-Corasick.

Returning overlapping matches would degrade the performance to O(nm) in pathological cases like finding aaaa in aaaaa....aaaaaa as for each match it would scan back the whole m characters of the pattern.

Instances

Instances details

FromJSON Automaton Source #
Instance details Defined in Data.Text.BoyerMooreCI.Automaton Methods parseJSON :: Value -> Parser Automaton # parseJSONList :: Value -> Parser [Automaton] #
ToJSON Automaton Source #
Instance details Defined in Data.Text.BoyerMooreCI.Automaton Methods toJSON :: Automaton -> Value # toEncoding :: Automaton -> Encoding # toJSONList :: [Automaton] -> Value # toEncodingList :: [Automaton] -> Encoding #
Generic Automaton Source #
Instance details Defined in Data.Text.BoyerMooreCI.Automaton Associated Types type Rep Automaton :: Type -> Type # Methods from :: Automaton -> Rep Automaton x # to :: Rep Automaton x -> Automaton #
Show Automaton Source #
Instance details Defined in Data.Text.BoyerMooreCI.Automaton Methods showsPrec :: Int -> Automaton -> ShowS # show :: Automaton -> String # showList :: [Automaton] -> ShowS #
NFData Automaton Source #
Instance details Defined in Data.Text.BoyerMooreCI.Automaton Methods rnf :: Automaton -> () #
Eq Automaton Source #
Instance details Defined in Data.Text.BoyerMooreCI.Automaton Methods (==) :: Automaton -> Automaton -> Bool # (/=) :: Automaton -> Automaton -> Bool #
Hashable Automaton Source #
Instance details Defined in Data.Text.BoyerMooreCI.Automaton Methods hashWithSalt :: Int -> Automaton -> Int # hash :: Automaton -> Int #
type Rep Automaton Source #
Instance details Defined in Data.Text.BoyerMooreCI.Automaton type Rep Automaton

data CaseSensitivity Source #

Constructors

CaseSensitive
IgnoreCase

Instances

Instances details

FromJSON CaseSensitivity Source #
Instance details Defined in Data.Text.CaseSensitivity Methods parseJSON :: Value -> Parser CaseSensitivity # parseJSONList :: Value -> Parser [CaseSensitivity] #
ToJSON CaseSensitivity Source #
Instance details Defined in Data.Text.CaseSensitivity Methods toJSON :: CaseSensitivity -> Value # toEncoding :: CaseSensitivity -> Encoding # toJSONList :: [CaseSensitivity] -> Value # toEncodingList :: [CaseSensitivity] -> Encoding #
Generic CaseSensitivity Source #
Instance details Defined in Data.Text.CaseSensitivity Associated Types type Rep CaseSensitivity :: Type -> Type # Methods from :: CaseSensitivity -> Rep CaseSensitivity x # to :: Rep CaseSensitivity x -> CaseSensitivity #
Show CaseSensitivity Source #
Instance details Defined in Data.Text.CaseSensitivity Methods showsPrec :: Int -> CaseSensitivity -> ShowS # show :: CaseSensitivity -> String # showList :: [CaseSensitivity] -> ShowS #
NFData CaseSensitivity Source #
Instance details Defined in Data.Text.CaseSensitivity Methods rnf :: CaseSensitivity -> () #
Eq CaseSensitivity Source #
Instance details Defined in Data.Text.CaseSensitivity Methods (==) :: CaseSensitivity -> CaseSensitivity -> Bool # (/=) :: CaseSensitivity -> CaseSensitivity -> Bool #
Hashable CaseSensitivity Source #
Instance details Defined in Data.Text.CaseSensitivity Methods hashWithSalt :: Int -> CaseSensitivity -> Int # hash :: CaseSensitivity -> Int #
type Rep CaseSensitivity Source #
Instance details Defined in Data.Text.CaseSensitivity type Rep CaseSensitivity = D1 ('MetaData "CaseSensitivity" "Data.Text.CaseSensitivity" "alfred-margaret-2.1.0.0-GaLGdvCW2mGJuL9TH52qO1" 'False) (C1 ('MetaCons "CaseSensitive" 'PrefixI 'False) (U1 :: Type -> Type) :+: C1 ('MetaCons "IgnoreCase" 'PrefixI 'False) (U1 :: Type -> Type))

newtype CodeUnitIndex Source #

An index into the raw UTF-8 data of a Text. This is not the code point index as conventionally accepted by Text, so we wrap it to avoid confusing the two. Incorrect index manipulation can lead to surrogate pairs being sliced, so manipulate indices with care. This type is also used for lengths.

Constructors

CodeUnitIndex
Fields codeUnitIndex :: Int

Instances

Instances details

FromJSON CodeUnitIndex Source #
Instance details Defined in Data.Text.Utf8 Methods parseJSON :: Value -> Parser CodeUnitIndex # parseJSONList :: Value -> Parser [CodeUnitIndex] #
ToJSON CodeUnitIndex Source #
Instance details Defined in Data.Text.Utf8 Methods toJSON :: CodeUnitIndex -> Value # toEncoding :: CodeUnitIndex -> Encoding # toJSONList :: [CodeUnitIndex] -> Value # toEncodingList :: [CodeUnitIndex] -> Encoding #
Bounded CodeUnitIndex Source #
Instance details Defined in Data.Text.Utf8 Methods minBound :: CodeUnitIndex # maxBound :: CodeUnitIndex #
Generic CodeUnitIndex Source #
Instance details Defined in Data.Text.Utf8 Associated Types type Rep CodeUnitIndex :: Type -> Type # Methods from :: CodeUnitIndex -> Rep CodeUnitIndex x # to :: Rep CodeUnitIndex x -> CodeUnitIndex #
Num CodeUnitIndex Source #
Instance details Defined in Data.Text.Utf8 Methods (+) :: CodeUnitIndex -> CodeUnitIndex -> CodeUnitIndex # (-) :: CodeUnitIndex -> CodeUnitIndex -> CodeUnitIndex # (*) :: CodeUnitIndex -> CodeUnitIndex -> CodeUnitIndex # negate :: CodeUnitIndex -> CodeUnitIndex # abs :: CodeUnitIndex -> CodeUnitIndex # signum :: CodeUnitIndex -> CodeUnitIndex # fromInteger :: Integer -> CodeUnitIndex #
Show CodeUnitIndex Source #
Instance details Defined in Data.Text.Utf8 Methods showsPrec :: Int -> CodeUnitIndex -> ShowS # show :: CodeUnitIndex -> String # showList :: [CodeUnitIndex] -> ShowS #
NFData CodeUnitIndex Source #
Instance details Defined in Data.Text.Utf8 Methods rnf :: CodeUnitIndex -> () #
Eq CodeUnitIndex Source #
Instance details Defined in Data.Text.Utf8 Methods (==) :: CodeUnitIndex -> CodeUnitIndex -> Bool # (/=) :: CodeUnitIndex -> CodeUnitIndex -> Bool #
Ord CodeUnitIndex Source #
Instance details Defined in Data.Text.Utf8 Methods compare :: CodeUnitIndex -> CodeUnitIndex -> Ordering # (<) :: CodeUnitIndex -> CodeUnitIndex -> Bool # (<=) :: CodeUnitIndex -> CodeUnitIndex -> Bool # (>) :: CodeUnitIndex -> CodeUnitIndex -> Bool # (>=) :: CodeUnitIndex -> CodeUnitIndex -> Bool # max :: CodeUnitIndex -> CodeUnitIndex -> CodeUnitIndex # min :: CodeUnitIndex -> CodeUnitIndex -> CodeUnitIndex #
Hashable CodeUnitIndex Source #
Instance details Defined in Data.Text.Utf8 Methods hashWithSalt :: Int -> CodeUnitIndex -> Int # hash :: CodeUnitIndex -> Int #
Prim CodeUnitIndex Source #
Instance details Defined in Data.Text.Utf8 Methods sizeOf# :: CodeUnitIndex -> Int# # alignment# :: CodeUnitIndex -> Int# # indexByteArray# :: ByteArray# -> Int# -> CodeUnitIndex # readByteArray# :: MutableByteArray# s -> Int# -> State# s -> (# State# s, CodeUnitIndex #) # writeByteArray# :: MutableByteArray# s -> Int# -> CodeUnitIndex -> State# s -> State# s # setByteArray# :: MutableByteArray# s -> Int# -> Int# -> CodeUnitIndex -> State# s -> State# s # indexOffAddr# :: Addr# -> Int# -> CodeUnitIndex # readOffAddr# :: Addr# -> Int# -> State# s -> (# State# s, CodeUnitIndex #) # writeOffAddr# :: Addr# -> Int# -> CodeUnitIndex -> State# s -> State# s # setOffAddr# :: Addr# -> Int# -> Int# -> CodeUnitIndex -> State# s -> State# s #
type Rep CodeUnitIndex Source #
Instance details Defined in Data.Text.Utf8 type Rep CodeUnitIndex = D1 ('MetaData "CodeUnitIndex" "Data.Text.Utf8" "alfred-margaret-2.1.0.0-GaLGdvCW2mGJuL9TH52qO1" 'True) (C1 ('MetaCons "CodeUnitIndex" 'PrefixI 'True) (S1 ('MetaSel ('Just "codeUnitIndex") 'NoSourceUnpackedness 'NoSourceStrictness 'DecidedLazy) (Rec0 Int)))

data Next a Source #

Constructors

Done !a
Step !a

buildAutomaton :: Text -> Automaton Source #

patternLength :: Automaton -> CodeUnitIndex Source #

Length of the matched pattern measured in UTF-8 code units (bytes).

patternText :: Automaton -> Text Source #

Return the pattern that was used to construct the automaton, O(n).

runText :: forall a. a -> (a -> CodeUnitIndex -> CodeUnitIndex -> Next a) -> Automaton -> Text -> a Source #

Finds all matches in the text, calling the match callback with the first and last byte index of each match of the pattern.

minimumSkipForCodePoint :: CodePoint -> CodeUnitIndex Source #

Number of bytes that we can skip in the haystack if we want to skip no more than 1 pattern codepoint.

It must always be a low (safe) estimate, otherwise the algorithm can miss matches. It must account for any variation of upper/lower case characters that may occur in the haystack. In most cases, this is the same number of bytes as for the given codepoint

minimumSkipForCodePoint a == 1 minimumSkipForCodePoint д == 2 minimumSkipForCodePoint ⓟ == 3 minimumSkipForCodePoint 🎄 == 4