alfred-margaret-1.1.1.0: Fast Aho-Corasick string searching

Safe HaskellNone
LanguageHaskell2010

Data.Text.AhoCorasick.Automaton

Description

An efficient implementation of the Aho-Corasick string matching algorithm. See http://web.stanford.edu/class/archive/cs/cs166/cs166.1166/lectures/02/Small02.pdf for a good explanation of the algorithm.

The memory layout of the automaton, and the function that steps it, were optimized to the point where string matching compiles roughly to a loop over the code units in the input text, that keeps track of the current state. Lookup of the next state is either just an array index (for the root state), or a linear scan through a small array (for non-root states). The pointer chases that are common for traversing Haskell data structures have been eliminated.

The construction of the automaton has not been optimized that much, because construction time is usually negligible in comparison to matching time. Therefore construction is a two-step process, where first we build the automaton as int maps, which are convenient for incremental construction. Afterwards we pack the automaton into unboxed vectors.

Synopsis

Documentation

data AcMachine v Source #

An Aho-Corasick automaton.

Constructors

AcMachine 

Fields

  • machineValues :: !(Vector [v])

    For every state, the values associated with its needles. If the state is not a match state, the list is empty.

  • machineTransitions :: !(Vector Transition)

    A packed vector of transitions. For every state, there is a slice of this vector that starts at the offset given by machineOffsets, and ends at the first wildcard transition.

  • machineOffsets :: !(Vector Int)

    For every state, the index into machineTransitions where the transition list for that state starts.

  • machineRootAsciiTransitions :: !(Vector Transition)

    A lookup table for transitions from the root state, an optimization to avoid having to walk all transitions, at the cost of using a bit of additional memory.

Instances
Generic (AcMachine v) Source # 
Instance details

Defined in Data.Text.AhoCorasick.Automaton

Associated Types

type Rep (AcMachine v) :: Type -> Type #

Methods

from :: AcMachine v -> Rep (AcMachine v) x #

to :: Rep (AcMachine v) x -> AcMachine v #

NFData v => NFData (AcMachine v) Source # 
Instance details

Defined in Data.Text.AhoCorasick.Automaton

Methods

rnf :: AcMachine v -> () #

type Rep (AcMachine v) Source # 
Instance details

Defined in Data.Text.AhoCorasick.Automaton

type Rep (AcMachine v)

build :: [([CodeUnit], v)] -> AcMachine v Source #

Construct an Aho-Corasick automaton for the given needles. Takes a list of code units rather than Text, to allow mapping the code units before construction, for example to lowercase individual code points, rather than doing proper case folding (which might change the number of code units).

runText :: forall a v. a -> (a -> Match v -> Next a) -> AcMachine v -> Text -> a Source #

runLower :: forall a v. a -> (a -> Match v -> Next a) -> AcMachine v -> Text -> a Source #

debugBuildDot :: [[CodeUnit]] -> String Source #

Build the automaton, and format it as Graphviz Dot, for visual debugging.

data CaseSensitivity Source #

Constructors

CaseSensitive 
IgnoreCase 
Instances
Eq CaseSensitivity Source # 
Instance details

Defined in Data.Text.AhoCorasick.Automaton

Show CaseSensitivity Source # 
Instance details

Defined in Data.Text.AhoCorasick.Automaton

Generic CaseSensitivity Source # 
Instance details

Defined in Data.Text.AhoCorasick.Automaton

Associated Types

type Rep CaseSensitivity :: Type -> Type #

Hashable CaseSensitivity Source # 
Instance details

Defined in Data.Text.AhoCorasick.Automaton

ToJSON CaseSensitivity Source # 
Instance details

Defined in Data.Text.AhoCorasick.Automaton

FromJSON CaseSensitivity Source # 
Instance details

Defined in Data.Text.AhoCorasick.Automaton

NFData CaseSensitivity Source # 
Instance details

Defined in Data.Text.AhoCorasick.Automaton

Methods

rnf :: CaseSensitivity -> () #

type Rep CaseSensitivity Source # 
Instance details

Defined in Data.Text.AhoCorasick.Automaton

type Rep CaseSensitivity = D1 (MetaData "CaseSensitivity" "Data.Text.AhoCorasick.Automaton" "alfred-margaret-1.1.1.0-C7p4DoDIXY7azqNtjX433" False) (C1 (MetaCons "CaseSensitive" PrefixI False) (U1 :: Type -> Type) :+: C1 (MetaCons "IgnoreCase" PrefixI False) (U1 :: Type -> Type))

newtype CodeUnitIndex Source #

An index into the raw UTF-16 data of a Text. This is not the code point index as conventionally accepted by Text, so we wrap it to avoid confusing the two. Incorrect index manipulation can lead to surrogate pairs being sliced, so manipulate indices with care. This type is also used for lengths.

Constructors

CodeUnitIndex 

Fields

Instances
Bounded CodeUnitIndex Source # 
Instance details

Defined in Data.Text.Utf16

Eq CodeUnitIndex Source # 
Instance details

Defined in Data.Text.Utf16

Num CodeUnitIndex Source # 
Instance details

Defined in Data.Text.Utf16

Ord CodeUnitIndex Source # 
Instance details

Defined in Data.Text.Utf16

Show CodeUnitIndex Source # 
Instance details

Defined in Data.Text.Utf16

Generic CodeUnitIndex Source # 
Instance details

Defined in Data.Text.Utf16

Associated Types

type Rep CodeUnitIndex :: Type -> Type #

Hashable CodeUnitIndex Source # 
Instance details

Defined in Data.Text.Utf16

ToJSON CodeUnitIndex Source # 
Instance details

Defined in Data.Text.Utf16

FromJSON CodeUnitIndex Source # 
Instance details

Defined in Data.Text.Utf16

NFData CodeUnitIndex Source # 
Instance details

Defined in Data.Text.Utf16

Methods

rnf :: CodeUnitIndex -> () #

type Rep CodeUnitIndex Source # 
Instance details

Defined in Data.Text.Utf16

type Rep CodeUnitIndex = D1 (MetaData "CodeUnitIndex" "Data.Text.Utf16" "alfred-margaret-1.1.1.0-C7p4DoDIXY7azqNtjX433" True) (C1 (MetaCons "CodeUnitIndex" PrefixI True) (S1 (MetaSel (Just "codeUnitIndex") NoSourceUnpackedness NoSourceStrictness DecidedLazy) (Rec0 Int)))

data Match v Source #

Constructors

Match 

Fields

  • matchPos :: !CodeUnitIndex

    The code unit index past the last code unit of the match. Note that this is not a code *point* (Haskell Char) index; a code point might be encoded as two code units.

  • matchValue :: v

    The payload associated with the matched needle.

Instances
Eq v => Eq (Match v) Source # 
Instance details

Defined in Data.Text.AhoCorasick.Automaton

Methods

(==) :: Match v -> Match v -> Bool #

(/=) :: Match v -> Match v -> Bool #

Show v => Show (Match v) Source # 
Instance details

Defined in Data.Text.AhoCorasick.Automaton

Methods

showsPrec :: Int -> Match v -> ShowS #

show :: Match v -> String #

showList :: [Match v] -> ShowS #

data Next a Source #

Result of handling a match: stepping the automaton can exit early by returning a Done, or it can continue with a new accumulator with Step.

Constructors

Done !a 
Step !a