AUTHOR
 Dr. Alistair Ward
DESCRIPTION
 This implementation of extended regexes, generalises the familiar concept of patternmatching of characterstrings, to matching lists composed from an arbitrary datatype.
The polymorphic data, from which the input datalist is composed, need only support
Eq
&Show
.  Because of the unknown stringified form of the underlying polymorphic data,
the regex must be described by a comparatively verbose bracketed & commaseparated list, rather than the traditional String containing Metacharacters.
Each element of this
Concatenation
is aRepeatablePattern
, which describes a permissible match againstInputData
. 
RepeatablePattern
can take one of two forms. In the simplest case, it matches just a single item of the underlying polymorphic type, perhaps literally, though looser specifications also exist:.
matches any input datum;[x, y, z]
matches any ofx
,y
, orz
;[^x, y, z]
matches anything butx
,y
, orz
. To support POSIX EREs,RepeatablePattern
can also be a listAlternatives
, each of which is recursively defined as anExtendedRegEx
, to form a treestructure.  Each
Pattern
, can optionally be quantified by either a traditional greedy, or a Perlstyle nongreedy, suffix, e.g.;[*, +, ?, {n, m}, {n,}, {n}, *?, +?, ??, {n, m}?, {n,}?]
.  For convenience, common specifications can be canned & assigned a single
Char
mnemonic, for subsequent reference. SinceExtendedRegEx
is polymorphic, the set of abbreviations appropriate in the context of the unspecified basetype, must be implemented externally through theShortcutExpander
interface. This permits the use, when the typeparameter isChar
, of Perlstyle shortcuts[\d\D\s\S\w\W]
.  The algorithm, is the classic backtracking one, rather than either a DFA or NFA.
This permits construction of
Result
via which one can discover the deep mapping ofInputData
intoExtendedRegEx
, & provides the flexibility to add the features now expected by modern regexengines. Since the typeparameter is unknown, & may represent a large object, the exponential spacecomplexity of creating a DFA may present additional problems. The exponential timecomplexity of the backtracking algorithm is partially tamed by targeting obvious inefficiencies with specific optimisations.  Charbased regexen, traditionally overload the delimiters of a set of
Alternatives
(parentheses), as a request for datacapture. Here, in contrast, allRepeatablePattern
s capture data, & repeated subexpressions capture a list of data, rather than arbitrarily recording just the last (http://www.opengroup.org/onlinepubs/009695399/functions/regcomp.html) item.
REFERENCES
 http://en.wikipedia.org/wiki/Regular_expression
 http://swtch.com/~rsc/regexp/regexp1.html
 http://docstore.mik.ua/orelly/perl/prog/ch02_04.htm#PERL2CH2SECT4.1.2
 http://www.macs.hw.ac.uk/~dsg/gph/papers/html/Strategies/strategies.html
 http://hackage.haskell.org/packages/archive/regexposix/latest/doc/html/TextRegexPosix.html
 http://www.haskell.org/haskellwiki/Regular_expressions
 http://www2.research.att.com/~gsf/testregex/reinterpretation.html
CAVEATS
 Because of the definition of mutually recursive datatypes, it is difficult to split this annoyingly large module, & preserve compatibility across compilers, but it may be possible to break this cyclic dependency, by defining an interface to which one of the datatypes defined here conforms.
 Doesn't implement Backreferences, making the definition of the
ExtendedRegEx
contextfree.  There's no integration with the typeclasses defined in Text.Regex.Base.RegexLike, which assumes
Char
basedInputData
; though this could be added to a specialised instance.  When
Alternatives
are defined,Result
becomes a treelike structure. Unless the alternative is a singleton, the specific alternative selected in the solution is typically unknown, & therefore the structure of the branch of this tree is also unknown. This lack of clarity is compounded when theAlternatives
areRepeatable
, since a different one may be selected on each successive repetition. Consequently, the user can't navigate this portion of the structure in a statically defined manner, to acquire the captured data. Despite this, & in contrast to other regexengines, access to the whole datastructure is available, since it doesn't seem advantage to hide it. The user can then either useextractDataFromMatch
for that element ofResult
, thus aggregating the data from sections of unknown structure, orshow
it, as an aid to debugging.
TODO
 Test paralleloperation, on a 3 or more processor machine.
If
rnf
is less effective thanrwhnf
, then theNFData
context can be removed, reducing the requirements imposed on the typeparametera
.  Try
Data.List.Stream
(streamfusion), a faster dropin replacement forData.List
; possibly integrated in GHC6.12. 
bypassInputDataForLiberalConsumer
is too restrictive. More generally, we can test whether the set of differenta
inInputData
, is a subset of those common to all remaining terms in theExtendedRegEx
. Using this rule, we can inferaaa ... =~ MkExtendedRegEx [a,a+,a?,[ab]{2,3}]
, given compatibleconsumptionBounds
.  Nested repetitions, where nothing has been added to the expression, result in repeated trials of the same expression,
e.g.;
(x{i,}){j,}
results in the same expansion for(i, j) in [(2, 3), (3, 2), (6, 1), (1, 6)]
. The resultingMatchList
may be different, but if the first such trial fails, so will all the remainder.  Should cope with empty sets of
Alternatives
& zero repetitions, neither of which can ever match, but the wider pattern can, e.g.(()x{0}y)
.  By removing
RepeatablePattern
fromMatch
, it can be isolated in a new module. This would result in a significant loss of discoverability.  Expand repeated
Bow
withfewest  1
null matches followed by recursivefindMatch
call withrepetitions = 1
.
 class ShortcutExpander m where
 expand :: Char > ExtendedRegEx m
 type Concatenation m = [RepeatablePattern m]
 type ExternalMatch m = Maybe (Match m)
 type InputData m = [m]
 type MatchedData m = (RepeatablePattern m, DataLength, InputData m)
 type MatchList m = [Match m]
 type RepeatablePattern m = Repeatable (Pattern m)
 newtype Alternatives m = MkAlternatives [ExtendedRegEx m]
 type Match m = Tree (MatchedData m)
 data ExtendedRegEx m = MkExtendedRegEx {}
 data Pattern m
 = Require (Meta m)
  CaptureGroup (Alternatives m)
 type Result m = (ExternalMatch m, Maybe (MatchList m), ExternalMatch m)
 alternativeExtendedRegExSeparatorToken :: Char
 captureGroupDelimiters :: (Char, Char)
 tokens :: String
 dock :: Transformation m
 captureGroup :: [ExtendedRegEx m] > Pattern m
 shiftMatchList :: DataLength > MatchList m > MatchList m
 showsMaybeAnchor :: Maybe Anchor > String > String
 simply :: Meta m > RepeatablePattern m
 transformExtendedRegEx :: (Concatenation m > Concatenation m) > Transformation m
 deconstructAlternatives :: Alternatives m > [ExtendedRegEx m]
 (+~) :: (Eq m, NFData m, Show m) => InputData m > RegExOpts (ExtendedRegEx m) > Result m
 (=~) :: (Eq m, NFData m, Show m) => InputData m > RegExOpts (ExtendedRegEx m) > Bool
 (/~) :: (Eq m, NFData m, Show m) => InputData m > RegExOpts (ExtendedRegEx m) > Bool
 (.*) :: RepeatablePattern m
 (.*?) :: RepeatablePattern m
 isDefined :: ExtendedRegEx m > Bool
 isCaptureGroup :: Pattern m > Bool
 isSingletonAlternatives :: Alternatives m > Bool
 externalMatchLength :: ExternalMatch m > DataLength
 extractDataFromMatch :: Match m > InputData m
 extractDataFromMatch' :: Maybe (Match m) > InputData m
 extractDataFromMatchList :: MatchList m > InputData m
Typeclasses
class ShortcutExpander m whereSource
 Defines the method required to expand a mnemonic into an
ExtendedRegEx
.  CAVEAT: this interface must be declared locally, since it references
ExtendedRegEx
, &ExtendedRegEx
references it.
:: Char  
> ExtendedRegEx m  Expand a single 
Types
Typesynonyms
type Concatenation m = [RepeatablePattern m]Source
Represents the concatenation aspect of ExtendedRegEx
s.
type ExternalMatch m = Maybe (Match m)Source
At the toplevel of an ExtendedRegEx
, the lack of an Anchor
allows the ExtendedRegEx
to drift away from the corresponding end of the inputdata; this datagap is captured here.
 The inputdata is just a list.
 Whilst typically this list is also a
String
, & could therefore be more efficiently implemented using Data.ByteString, we can't assume that the polymorphic basetype is alwaysChar
.
type MatchedData m = (RepeatablePattern m, DataLength, InputData m)Source
Tag the InputData
with the RepeatablePattern
it matched (which unfortunately confines the definition to this (bloated) module), & the offset from the start of the data;
type MatchList m = [Match m]Source
Describes the manner in which a Concatenation
successfully consumed InputData
.
type RepeatablePattern m = Repeatable (Pattern m)Source
Make Pattern
s, Repeatable
.
Datatypes
newtype Alternatives m Source
 Represents the alternation feature of
ExtendedRegEx
s.  One could amalgamate this with
Pattern
, since it seems to exist merely as a peg to hang instancedeclarations from.
Eq m => Eq (Alternatives m)  
(ShortcutExpander m, ShortcutExpander m, Eq m, Read m) => Read (Alternatives m)  
Show m => Show (Alternatives m)  
NFData m => NFData (Alternatives m)  
SelfValidator (Alternatives m)  
Consumer (Alternatives m) 
type Match m = Tree (MatchedData m)Source
Describes the manner in which a RepeatablePattern
successfully consumed InputData
.
data ExtendedRegEx m Source
Constructs an ExtendedRegEx
, by surrounding a Concatenation
with optional Anchor
s.
MkExtendedRegEx  

Eq m => Eq (ExtendedRegEx m)  
(ShortcutExpander m, ShortcutExpander m, Eq m, Read m) => Read (ExtendedRegEx m)  
Show m => Show (ExtendedRegEx m)  
NFData m => NFData (ExtendedRegEx m)  
SelfValidator (ExtendedRegEx m)  
Consumer (ExtendedRegEx m) 
Defines either a simple Meta
, which can match exactly one datum, or a set of Alternatives
, each of which is recursively defined above, as an ExtendedRegEx
.
Require (Meta m)  Describes a requirement for a simple scalar datum of the polymorphic type. 
CaptureGroup (Alternatives m)  A subexpression containing a selection of recursively defined alternatives, thus forming a treestructure. 
type Result m = (ExternalMatch m, Maybe (MatchList m), ExternalMatch m)Source
Captures the list of inputdata consumed by the Concatenation
, bracketed by any dataprefix or datasuffix.
Constants
alternativeExtendedRegExSeparatorToken :: CharSource
The token used to separate alternative ExtendedRegEx
s, when in the String
form.
captureGroupDelimiters :: (Char, Char)Source
The delimiters of Alternatives
, when in the String
form.
Functions
dock :: Transformation mSource
Drop Anchor
s at both bow & stern of the specified ExtendedRegEx
.
captureGroup :: [ExtendedRegEx m] > Pattern mSource
Conveniencefunction to build a CaptureGroup
from a list of alternative ExtendedRegEx
s.
:: DataLength  The offset by which to shift the position into the inputdata at which a each listed match occurred. 
> MatchList m  The list of matchstructures, each of whose offsets are to be shifted. 
> MatchList m 
Shifts the offsets of all the MatchedData
contained in the specified MatchList
.
simply :: Meta m > RepeatablePattern mSource
Conveniencefunction, to build a RepeatablePattern
from an unrepeated instance of the specified Meta
datum.
:: (Concatenation m > Concatenation m)  The function used to transform the data behind the constructor. 
> Transformation m 
Similar to fmap
, but operates on Concatenation
, rather than just a
.
Accessors (Deconstructors)
deconstructAlternatives :: Alternatives m > [ExtendedRegEx m]Source
Accessor, to expose the guts.
Operators
:: (Eq m, NFData m, Show m)  
=> InputData m  The input data within which to locate a match. 
> RegExOpts (ExtendedRegEx m)  The matchoptions parameterised by the regex against which to match the input data. 
> Result m 
 Operator's name was chosen to suggest something more than
=~
.  CAVEAT: much more expensive then
=~
: in ghci,Just
can be observed to be printed long before theMatchList
from whichResult
is constructed, as the lazy algorithm finds the first solution, but not yet necessarily the optimal solution, amongstAlternatives
.
:: (Eq m, NFData m, Show m)  
=> InputData m  The input data within which to locate a match. 
> RegExOpts (ExtendedRegEx m)  The matchoptions parameterised by the regex against which to match the input data. 
> Bool 
 Patternmatch operator.
 Identifier & parameterorder follow the lead of Perl's patternmatch operator.
 Considerably more efficient than
+~
, since even though they are both implemented viafindMatch
, the discovery of any solution is sufficient to generate the returnvalue; lazyevaluation avoids the requirement to identify the irrelevant optimal solution.
:: (Eq m, NFData m, Show m)  
=> InputData m  The input data within which to locate a match. 
> RegExOpts (ExtendedRegEx m)  The matchoptions parameterised by the regex against which to match the input data. 
> Bool 
Patternmismatch operator.
(.*) :: RepeatablePattern mSource
 Represents a black hole, which will greedily consume all data.
 CAVEAT: nullary, ie a constant.
(.*?) :: RepeatablePattern mSource
A nongreedy version of .*
.
Predicates
isDefined :: ExtendedRegEx m > BoolSource
True
if there's at least one RepeatablePattern
in the Concatenation
, ie that it's nonnull.
isCaptureGroup :: Pattern m > BoolSource
True
if the Pattern
was constructed via CaptureGroup
.
isSingletonAlternatives :: Alternatives m > BoolSource
Alternatives
can be employed as a simple capturegroup as well as a switch, under which circumstances there's no choice amongst multiple Alternatives
.
Query
externalMatchLength :: ExternalMatch m > DataLengthSource
Returns the length of data consumed by the specified ExternalMatch
.
extractDataFromMatch :: Match m > InputData mSource
extractDataFromMatch' :: Maybe (Match m) > InputData mSource