regex-dfa-0.91: Replaces/Enhances Text.RegexContentsIndex
Text.Regex.DFA.Engine
Description

By Chris Kuklewicz (haskell (at) list (dot) mightyreason (dot) com), 2006.

This file is licensed under the LGPL (version 2, see the LICENSE file), because it is a derivative work. This DFAEngine takes the lazy transition table from Manuel Chakravarty's lexer in CTK Light, but uses it for simpler purposes. The original CTK code can be found here http://www.cse.unsw.edu.au/~chak/haskell/ctk/

Don Stewart (http://www.cse.unsw.edu.au/~dons/contact.html) also contributed to this code.

I want the freedom to alter the types a bit, so this is a separate module.

The CTK and DFA code can be thought of as three parts:

1. The ability to compose Regexp combinators which will lazily assemble a DFA. This is mainly bound up in the Cont type and the internal functions that merge it (exported as >||<).

2. The interface of how to specify Failure and Success. This was bound up in LexAction holding an function and is now lexAcceptlesFailurelexContinue.

3. The traversal engine. At each longer and longer match the last seen match is updated. Different traversals keep track of different levels of detail.

As a descendent of the regex-dna entry at http://shootout.alioth.debian.org/gp4/benchmark.php?test=regexdna&lang=all, this module has contributions from Don Stewart, Alson Kemp, and Chris Kuklewicz.

Synopsis
data Lexer
= Lexer (Set DoPa) !LexAction Cont
| Predicate {
rIndex :: (Set DoPa)
whichBoundary :: Boundary
atBoundary :: Lexer
notAtBoundary :: Lexer
}
newtype LexAction = LexAction Int
type Regexp = Lexer -> Lexer
data Cont = Done
data Boundary
= BeginLine
| EndLine
| BeginInput
| EndInput
emptyOp :: Regexp
char :: DoPa -> Char -> Regexp
alt :: DoPa -> [Char] -> Regexp
altNot :: DoPa -> [Char] -> Regexp
allChar :: DoPa -> Regexp
beginLine :: DoPa -> Regexp
endLine :: DoPa -> Regexp
beginInput :: DoPa -> Regexp
endInput :: DoPa -> Regexp
(>|<) :: Regexp -> Regexp -> Regexp
orRE :: [Regexp] -> Regexp
(+>) :: Regexp -> Regexp -> Regexp
concatRE :: [Regexp] -> Regexp
quest :: Regexp -> Regexp -> Regexp
star :: Regexp -> Regexp -> Regexp
plus :: Regexp -> Regexp -> Regexp
failure :: Lexer
accept :: Regexp -> Lexer
(>||<) :: Lexer -> Lexer -> Lexer
findRegex :: Lexer -> String -> (String, Int, Maybe (String, Int, String))
matchesRegex :: Lexer -> [Char] -> Bool
testHere :: Lexer -> Int -> Char -> [Char] -> Bool
countRegex :: Lexer -> [Char] -> Int
findRegexS :: Lexer -> String -> (String, Maybe (String, Int, String))
peek :: Cont -> Char -> Lexer
inBounds :: Char -> BoundsNum -> Bool
lexFailure :: LexAction
lexContinue :: LexAction
lexAccept :: LexAction
Documentation
data Lexer

tree structure used to represent the lexer table

each node in the tree corresponds to a of the lexer; the associated actions are those that apply when the corresponding is reached

Constructors
Lexer (Set DoPa) !LexAction Cont
Predicate
rIndex :: (Set DoPa)
whichBoundary :: Boundary
atBoundary :: Lexer
notAtBoundary :: Lexer
show/hide Instances
newtype LexAction
This is interface between the DFA table and the traversal engine, and is simpler than the original CTK version.
Constructors
LexAction Int
show/hide Instances
type Regexp = Lexer -> Lexer
a regular expression
data Cont
Done or a table-like-thing to associate the next character with a Lexer
Constructors
Done
show/hide Instances
data Boundary
Need to encode, as data, the between character decision am I at this boundary?. The different types of boundary checks and their outcomes are all value of Boundary. Currently only the ^ and $ anchors are encoded.
Constructors
BeginLine
EndLine
BeginInput
EndInput
show/hide Instances
emptyOp :: Regexp

These create Regexp

Empty lexeme (noop)

char :: DoPa -> Char -> Regexp
One character regexp
alt :: DoPa -> [Char] -> Regexp
accepts a list of alternative characters Equiv. to `(foldr1 (>|<) . map char) cs', but much faster
altNot :: DoPa -> [Char] -> Regexp
accepts an inverted list of alternative characters Equiv. to `(foldr1 (>|<) . map char) cs', but much faster
allChar :: DoPa -> Regexp
accepts any character
beginLine :: DoPa -> Regexp
endLine :: DoPa -> Regexp
beginInput :: DoPa -> Regexp
endInput :: DoPa -> Regexp
(>|<) :: Regexp -> Regexp -> Regexp

disjunctive combination of two regexps, corresponding to x|y.

This will find the longest match

orRE :: [Regexp] -> Regexp
(+>) :: Regexp -> Regexp -> Regexp

These combine two Regexp's

Concatenation of regexps is just concatenation of functions x +> y corresponds to xy

concatRE :: [Regexp] -> Regexp
quest :: Regexp -> Regexp -> Regexp
x quest y corresponds to the regular expression x?y
star :: Regexp -> Regexp -> Regexp
x star y corresponds to the regular expression x*y self is of type Lexer
plus :: Regexp -> Regexp -> Regexp
x plus y corresponds to the regular expression x+y
failure :: Lexer
accept :: Regexp -> Lexer
Have a match to Regexp be consider a success
(>||<) :: Lexer -> Lexer -> Lexer
disjunctive combination of two lexers (longest match, right biased)
findRegex
:: LexerThe regular expression to match
-> StringThe input string to scan along, looking for a match
-> (String, Int, Maybe (String, Int, String))The input string before the match, length of the string before the match, Nothing if there was no match or Just input string at the start of the match, length of the match, input string starting just past the match

This is the ultra-lazy matching engine. It returns the longest match.

This will not examine any more of the input than needed, checking and returning a character at a time. Once a character is read that leads to no possibility of future match it does not evaluate any deeper.

When a match is found, the input past match is not examined at all.

In the extreme case of the input string being (error _) this will still succeed if the Regexp matches only an empty string since the input will not be demanded at all. The input before matching in this case will be [] and its length is 0, and the length of the match is 0, which the input at start of match and the input past the match will both be (error _).

This loops over matchHere to find the first match

matchesRegex :: Lexer -> [Char] -> Bool
This searches the input string for a match to the regex There is no need to wait for the longest match, so stop at first lexAccept
testHere
:: Lexercurrent lexeme
-> IntOrigin offset
-> Charprevious input character
-> [Char]current input
-> Bool
This checks for a match to the regex starting at the beginning of the input There is no need to wait for the longest match, so stop at first lexAccept
countRegex :: Lexer -> [Char] -> Int
This counts the number of matches to regex in the string, (it checks each possible starting position). This should be the same as ((length (splitRegex re input))-1) but more efficient ^^^ fix
findRegexS :: Lexer -> String -> (String, Maybe (String, Int, String))
This is a version of findRegex that does not compute the length of the prefix
peek :: Cont -> Char -> Lexer
inBounds :: Char -> BoundsNum -> Bool
lexFailure :: LexAction
lexContinue :: LexAction
lexAccept :: LexAction
Produced by Haddock version 0.8