AUTHOR
- Dr. Alistair Ward
DESCRIPTION
- This implementation of extended regexes, generalises the familiar concept of pattern-matching of character-strings, to matching lists composed from an arbitrary data-type.
The polymorphic data, from which the input data-list is composed, need only support
Eq
&Show
. - Because of the unknown stringified form of the underlying polymorphic data,
the regex must be described by a comparatively verbose bracketed & comma-separated list, rather than the traditional String containing Meta-characters.
Each element of this
Concatenation
is aRepeatablePattern
, which describes a permissible match againstInputData
. -
RepeatablePattern
can take one of two forms. In the simplest case, it matches just a single item of the underlying polymorphic type, perhaps literally, though looser specifications also exist:.
matches any input datum;[x, y, z]
matches any ofx
,y
, orz
;[^x, y, z]
matches anything butx
,y
, orz
. To support POSIX EREs,RepeatablePattern
can also be a listAlternatives
, each of which is recursively defined as anExtendedRegEx
, to form a tree-structure. - Each
Pattern
, can optionally be quantified by either a traditional greedy, or a Perl-style non-greedy, suffix, e.g.;[*, +, ?, {n, m}, {n,}, {n}, *?, +?, ??, {n, m}?, {n,}?]
. - For convenience, common specifications can be canned & assigned a single
Char
mnemonic, for subsequent reference. SinceExtendedRegEx
is polymorphic, the set of abbreviations appropriate in the context of the unspecified base-type, must be implemented externally through theShortcutExpander
interface. This permits the use, when the type-parameter isChar
, of Perl-style shortcuts[\d\D\s\S\w\W]
. - The algorithm, is the classic back-tracking one, rather than either a DFA or NFA.
This permits construction of
Result
via which one can discover the deep mapping ofInputData
intoExtendedRegEx
, & provides the flexibility to add the features now expected by modern regex-engines. Since the type-parameter is unknown, & may represent a large object, the exponential space-complexity of creating a DFA may present additional problems. The exponential time-complexity of the back-tracking algorithm is partially tamed by targeting obvious inefficiencies with specific optimisations. - Char-based regexen, traditionally overload the delimiters of a set of
Alternatives
(parentheses), as a request for data-capture. Here, in contrast, allRepeatablePattern
s capture data, & repeated sub-expressions capture a list of data, rather than arbitrarily recording just the last (http://www.opengroup.org/onlinepubs/009695399/functions/regcomp.html) item.
REFERENCES
- http://en.wikipedia.org/wiki/Regular_expression
- http://swtch.com/~rsc/regexp/regexp1.html
- http://docstore.mik.ua/orelly/perl/prog/ch02_04.htm#PERL2-CH-2-SECT-4.1.2
- http://www.macs.hw.ac.uk/~dsg/gph/papers/html/Strategies/strategies.html
- http://hackage.haskell.org/packages/archive/regex-posix/latest/doc/html/Text-Regex-Posix.html
- http://www.haskell.org/haskellwiki/Regular_expressions
- http://www2.research.att.com/~gsf/testregex/re-interpretation.html
CAVEATS
- Because of the definition of mutually recursive data-types, it is difficult to split this annoyingly large module, & preserve compatibility across compilers, but it may be possible to break this cyclic dependency, by defining an interface to which one of the data-types defined here conforms.
- Doesn't implement Back-references, making the definition of the
ExtendedRegEx
context-free. - There's no integration with the type-classes defined in Text.Regex.Base.RegexLike, which assumes
Char
-basedInputData
; though this could be added to a specialised instance. - When
Alternatives
are defined,Result
becomes a tree-like structure. Unless the alternative is a singleton, the specific alternative selected in the solution is typically unknown, & therefore the structure of the branch of this tree is also unknown. This lack of clarity is compounded when theAlternatives
areRepeatable
, since a different one may be selected on each successive repetition. Consequently, the user can't navigate this portion of the structure in a statically defined manner, to acquire the captured data. Despite this, & in contrast to other regex-engines, access to the whole data-structure is available, since it doesn't seem advantage to hide it. The user can then either useextractDataFromMatch
for that element ofResult
, thus aggregating the data from sections of unknown structure, orshow
it, as an aid to debugging.
TODO
- Test parallel-operation, on a 3 or more processor machine.
If
rnf
is less effective thanrwhnf
, then theNFData
context can be removed, reducing the requirements imposed on the type-parametera
. - Try
Data.List.Stream
(stream-fusion), a faster drop-in replacement forData.List
; possibly integrated in GHC-6.12. -
bypassInputDataForLiberalConsumer
is too restrictive. More generally, we can test whether the set of differenta
inInputData
, is a subset of those common to all remaining terms in theExtendedRegEx
. Using this rule, we can inferaaa ... =~ MkExtendedRegEx [a,a+,a?,[ab]{2,3}]
, given compatibleconsumptionBounds
. - Nested repetitions, where nothing has been added to the expression, result in repeated trials of the same expression,
e.g.;
(x{i,}){j,}
results in the same expansion for(i, j) in [(2, 3), (3, 2), (6, 1), (1, 6)]
. The resultingMatchList
may be different, but if the first such trial fails, so will all the remainder. - Should cope with empty sets of
Alternatives
& zero repetitions, neither of which can ever match, but the wider pattern can, e.g.(()|x{0}|y)
. - By removing
RepeatablePattern
fromMatch
, it can be isolated in a new module. This would result in a significant loss of discoverability. - Expand repeated
Bow
withfewest - 1
null matches followed by recursivefindMatch
-call withrepetitions = 1
.
- class ShortcutExpander m where
- expand :: Char -> ExtendedRegEx m
- type Concatenation m = [RepeatablePattern m]
- type ExternalMatch m = Maybe (Match m)
- type InputData m = [m]
- type MatchedData m = (RepeatablePattern m, DataLength, InputData m)
- type MatchList m = [Match m]
- type RepeatablePattern m = Repeatable (Pattern m)
- newtype Alternatives m = MkAlternatives [ExtendedRegEx m]
- type Match m = Tree (MatchedData m)
- data ExtendedRegEx m = MkExtendedRegEx {}
- data Pattern m
- = Require (Meta m)
- | CaptureGroup (Alternatives m)
- type Result m = (ExternalMatch m, Maybe (MatchList m), ExternalMatch m)
- alternativeExtendedRegExSeparatorToken :: Char
- captureGroupDelimiters :: (Char, Char)
- tokens :: String
- dock :: Transformation m
- captureGroup :: [ExtendedRegEx m] -> Pattern m
- shiftMatchList :: DataLength -> MatchList m -> MatchList m
- showsMaybeAnchor :: Maybe Anchor -> String -> String
- simply :: Meta m -> RepeatablePattern m
- transformExtendedRegEx :: (Concatenation m -> Concatenation m) -> Transformation m
- deconstructAlternatives :: Alternatives m -> [ExtendedRegEx m]
- (+~) :: (Eq m, NFData m, Show m) => InputData m -> RegExOpts (ExtendedRegEx m) -> Result m
- (=~) :: (Eq m, NFData m, Show m) => InputData m -> RegExOpts (ExtendedRegEx m) -> Bool
- (/~) :: (Eq m, NFData m, Show m) => InputData m -> RegExOpts (ExtendedRegEx m) -> Bool
- (.*) :: RepeatablePattern m
- (.*?) :: RepeatablePattern m
- isDefined :: ExtendedRegEx m -> Bool
- isCaptureGroup :: Pattern m -> Bool
- isSingletonAlternatives :: Alternatives m -> Bool
- externalMatchLength :: ExternalMatch m -> DataLength
- extractDataFromMatch :: Match m -> InputData m
- extractDataFromMatch' :: Maybe (Match m) -> InputData m
- extractDataFromMatchList :: MatchList m -> InputData m
Type-classes
class ShortcutExpander m whereSource
- Defines the method required to expand a mnemonic into an
ExtendedRegEx
. - CAVEAT: this interface must be declared locally, since it references
ExtendedRegEx
, &ExtendedRegEx
references it.
:: Char | |
-> ExtendedRegEx m | Expand a single- |
Types
Type-synonyms
type Concatenation m = [RepeatablePattern m]Source
Represents the concatenation aspect of ExtendedRegEx
s.
type ExternalMatch m = Maybe (Match m)Source
At the top-level of an ExtendedRegEx
, the lack of an Anchor
allows the ExtendedRegEx
to drift away from the corresponding end of the input-data; this data-gap is captured here.
- The input-data is just a list.
- Whilst typically this list is also a
String
, & could therefore be more efficiently implemented using Data.ByteString, we can't assume that the polymorphic base-type is alwaysChar
.
type MatchedData m = (RepeatablePattern m, DataLength, InputData m)Source
Tag the InputData
with the RepeatablePattern
it matched (which unfortunately confines the definition to this (bloated) module), & the offset from the start of the data;
type MatchList m = [Match m]Source
Describes the manner in which a Concatenation
successfully consumed InputData
.
type RepeatablePattern m = Repeatable (Pattern m)Source
Make Pattern
s, Repeatable
.
Data-types
newtype Alternatives m Source
- Represents the alternation feature of
ExtendedRegEx
s. - One could amalgamate this with
Pattern
, since it seems to exist merely as a peg to hang instance-declarations from.
Eq m => Eq (Alternatives m) | |
(ShortcutExpander m, ShortcutExpander m, Eq m, Read m) => Read (Alternatives m) | |
Show m => Show (Alternatives m) | |
NFData m => NFData (Alternatives m) | |
SelfValidator (Alternatives m) | |
Consumer (Alternatives m) |
type Match m = Tree (MatchedData m)Source
Describes the manner in which a RepeatablePattern
successfully consumed InputData
.
data ExtendedRegEx m Source
Constructs an ExtendedRegEx
, by surrounding a Concatenation
with optional Anchor
s.
MkExtendedRegEx | |
|
Eq m => Eq (ExtendedRegEx m) | |
(ShortcutExpander m, ShortcutExpander m, Eq m, Read m) => Read (ExtendedRegEx m) | |
Show m => Show (ExtendedRegEx m) | |
NFData m => NFData (ExtendedRegEx m) | |
SelfValidator (ExtendedRegEx m) | |
Consumer (ExtendedRegEx m) |
Defines either a simple Meta
, which can match exactly one datum, or a set of Alternatives
, each of which is recursively defined above, as an ExtendedRegEx
.
Require (Meta m) | Describes a requirement for a simple scalar datum of the polymorphic type. |
CaptureGroup (Alternatives m) | A sub-expression containing a selection of recursively defined alternatives, thus forming a tree-structure. |
type Result m = (ExternalMatch m, Maybe (MatchList m), ExternalMatch m)Source
Captures the list of input-data consumed by the Concatenation
, bracketed by any data-prefix or data-suffix.
Constants
alternativeExtendedRegExSeparatorToken :: CharSource
The token used to separate alternative ExtendedRegEx
s, when in the String
-form.
captureGroupDelimiters :: (Char, Char)Source
The delimiters of Alternatives
, when in the String
-form.
Functions
dock :: Transformation mSource
Drop Anchor
s at both bow & stern of the specified ExtendedRegEx
.
captureGroup :: [ExtendedRegEx m] -> Pattern mSource
Convenience-function to build a CaptureGroup
from a list of alternative ExtendedRegEx
s.
:: DataLength | The offset by which to shift the position into the input-data at which a each listed match occurred. |
-> MatchList m | The list of match-structures, each of whose offsets are to be shifted. |
-> MatchList m |
Shifts the offsets of all the MatchedData
contained in the specified MatchList
.
simply :: Meta m -> RepeatablePattern mSource
Convenience-function, to build a RepeatablePattern
from an unrepeated instance of the specified Meta
-datum.
:: (Concatenation m -> Concatenation m) | The function used to transform the data behind the constructor. |
-> Transformation m |
Similar to fmap
, but operates on Concatenation
, rather than just a
.
Accessors (Deconstructors)
deconstructAlternatives :: Alternatives m -> [ExtendedRegEx m]Source
Accessor, to expose the guts.
Operators
:: (Eq m, NFData m, Show m) | |
=> InputData m | The input data within which to locate a match. |
-> RegExOpts (ExtendedRegEx m) | The match-options parameterised by the regex against which to match the input data. |
-> Result m |
- Operator's name was chosen to suggest something more than
=~
. - CAVEAT: much more expensive then
=~
: in ghci,Just
can be observed to be printed long before theMatchList
from whichResult
is constructed, as the lazy algorithm finds the first solution, but not yet necessarily the optimal solution, amongstAlternatives
.
:: (Eq m, NFData m, Show m) | |
=> InputData m | The input data within which to locate a match. |
-> RegExOpts (ExtendedRegEx m) | The match-options parameterised by the regex against which to match the input data. |
-> Bool |
- Pattern-match operator.
- Identifier & parameter-order follow the lead of Perl's pattern-match operator.
- Considerably more efficient than
+~
, since even though they are both implemented viafindMatch
, the discovery of any solution is sufficient to generate the return-value; lazy-evaluation avoids the requirement to identify the irrelevant optimal solution.
:: (Eq m, NFData m, Show m) | |
=> InputData m | The input data within which to locate a match. |
-> RegExOpts (ExtendedRegEx m) | The match-options parameterised by the regex against which to match the input data. |
-> Bool |
Pattern-mismatch operator.
(.*) :: RepeatablePattern mSource
- Represents a black hole, which will greedily consume all data.
- CAVEAT: nullary, ie a constant.
(.*?) :: RepeatablePattern mSource
A non-greedy version of .*
.
Predicates
isDefined :: ExtendedRegEx m -> BoolSource
True
if there's at least one RepeatablePattern
in the Concatenation
, ie that it's non-null.
isCaptureGroup :: Pattern m -> BoolSource
True
if the Pattern
was constructed via CaptureGroup
.
isSingletonAlternatives :: Alternatives m -> BoolSource
Alternatives
can be employed as a simple capture-group as well as a switch, under which circumstances there's no choice amongst multiple Alternatives
.
Query
externalMatchLength :: ExternalMatch m -> DataLengthSource
Returns the length of data consumed by the specified ExternalMatch
.
extractDataFromMatch :: Match m -> InputData mSource
extractDataFromMatch' :: Maybe (Match m) -> InputData mSource