| Safe Haskell | None | 
|---|---|
| Language | Haskell2010 | 
RegExDot.RegEx
Description
AUTHOR- Dr. Alistair Ward
 DESCRIPTION
- This implementation of extended regexes, generalises the familiar concept of pattern-matching of character-strings, to matching lists composed from an arbitrary data-type.
	The polymorphic data, from which the input data-list is composed, need only support 
Eq. - Because of the unknown stringified form of the underlying polymorphic data,
	the regex must be described by a comparatively verbose bracketed & comma-separated list, rather than the traditional String containing Meta-characters.
	Each element of this 
Concatenationis aRepeatablePattern, which describes a permissible match againstInputData. RepeatablePatterncan take one of two forms. In the simplest case, it matches just a single item of the underlying polymorphic type, perhaps literally, though looser specifications also exist:.matches any input datum;[x, y, z]matches any ofx,y, orz;[^x, y, z]matches anything butx,y, orz. To support POSIX EREs,RepeatablePatterncan also be a listAlternatives, each of which is recursively defined as anExtendedRegEx, to form a tree-structure.- Each 
Pattern, can optionally be quantified by either a traditional greedy, or a Perl-style non-greedy, suffix, e.g.;[*, +, ?, {n, m}, {n,}, {n}, *?, +?, ??, {n, m}?, {n,}?]. - For convenience, common specifications can be canned & assigned a single 
Charmnemonic, for subsequent reference. SinceExtendedRegExis polymorphic, the set of abbreviations appropriate in the context of the unspecified base-type, must be implemented externally through theShortcutExpanderinterface. This permits the use, when the type-parameter isChar, of Perl-style shortcuts[\d\D\s\S\w\W]. - The algorithm, is the classic back-tracking one, rather than either a DFA or NFA.
	This permits construction of 
Resultvia which one can discover the deep mapping ofInputDataintoExtendedRegEx, & provides the flexibility to add the features now expected by modern regex-engines. Since the type-parameter is unknown, & may represent a large object, the exponential space-complexity of creating a DFA may present additional problems. The exponential time-complexity of the back-tracking algorithm is partially tamed by targeting obvious inefficiencies with specific optimisations. - Char-based regexen, traditionally overload the delimiters of a set of 
Alternatives(parentheses), as a request for data-capture. Here, in contrast, allRepeatablePatterns capture data, & repeated sub-expressions capture a list of data, rather than arbitrarily recording just the last (https://www.opengroup.org/onlinepubs/009695399/functions/regcomp.html) item. 
REFERENCES
- https://en.wikipedia.org/wiki/Regular_expression
 - https://swtch.com/~rsc/regexp/regexp1.html
 - https://docstore.mik.ua/orelly/perl/prog/ch02_04.htm#PERL2-CH-2-SECT-4.1.2
 - https://www.macs.hw.ac.uk/~dsg/gph/papers/html/Strategies/strategies.html
 - https://hackage.haskell.org/packages/archive/regex-posix/latest/doc/html/Text-Regex-Posix.html
 - https://www.haskell.org/haskellwiki/Regular_expressions
 - https://www2.research.att.com/~gsf/testregex/re-interpretation.html
 
CAVEATS
- Because of the definition of mutually recursive data-types, it is difficult to split this annoyingly large module, & preserve compatibility across compilers, but it may be possible to break this cyclic dependency, by defining an interface to which one of the data-types defined here conforms.
 - Doesn't implement Back-references, making the definition of the 
ExtendedRegExcontext-free. - There's no integration with the type-classes defined in Text.Regex.Base.RegexLike, which assumes 
Char-basedInputData; though this could be added to a specialised instance. - When 
Alternativesare defined,Resultbecomes a tree-like structure. Unless the alternative is a singleton, the specific alternative selected in the solution is typically unknown, & therefore the structure of the branch of this tree is also unknown. This lack of clarity is compounded when theAlternativesareRepeatable, since a different one may be selected on each successive repetition. Consequently, the user can't navigate this portion of the structure in a statically defined manner, to acquire the captured data. Despite this, & in contrast to other regex-engines, access to the whole data-structure is available, since it doesn't seem advantage to hide it. The user can then either useextractDataFromMatchfor that element ofResult, thus aggregating the data from sections of unknown structure, orshowit, as an aid to debugging. 
TODO
- Test parallel-operation, on a 3 or more processor machine.
	If 
rnfis less effective thanrwhnf, then theNFDatacontext can be removed, reducing the requirements imposed on the type-parametera. - Try 
Stream(stream-fusion), a faster drop-in replacement forList; possibly integrated in GHC-6.12. bypassInputDataForLiberalConsumeris too restrictive. More generally, we can test whether the set of differentainInputData, is a subset of those common to all remaining terms in theExtendedRegEx. Using this rule, we can infer"aaa ..." =~ MkExtendedRegEx [a,a+,a?,[ab]{2,3}], given compatibleconsumptionBounds.- Nested repetitions, where nothing has been added to the expression, result in repeated trials of the same expression,
	e.g.; 
"(x{i,}){j,}"results in the same expansion for(i, j) in [(2, 3), (3, 2), (6, 1), (1, 6)]. The resultingMatchListmay be different, but if the first such trial fails, so will all the remainder. - Should cope with empty sets of 
Alternatives& zero repetitions, neither of which can ever match, but the wider pattern can, e.g.(()|x{0}|y). - By removing 
RepeatablePatternfromMatch, it can be isolated in a new module. This would result in a significant loss of discoverability. - Expand repeated 
Bowwithfewest - 1null matches followed by recursivefindMatch-call withrepetitions = 1. 
Synopsis
- class ShortcutExpander m where
- expand :: Char -> ExtendedRegEx m
 
 - type Concatenation m = [RepeatablePattern m]
 - type ExternalMatch m = Maybe (Match m)
 - type InputData m = [m]
 - type MatchedData m = (RepeatablePattern m, DataLength, InputData m)
 - type MatchList m = [Match m]
 - type RepeatablePattern m = Repeatable (Pattern m)
 - newtype Alternatives m = MkAlternatives {}
 - type Match m = Tree (MatchedData m)
 - data ExtendedRegEx m = MkExtendedRegEx {}
 - data Pattern m
- = Require (Meta m)
 - | CaptureGroup (Alternatives m)
 
 - type Result m = (ExternalMatch m, Maybe (MatchList m), ExternalMatch m)
 - alternativeExtendedRegExSeparatorToken :: Char
 - captureGroupDelimiters :: (Char, Char)
 - tokens :: String
 - dock :: Transformation m
 - captureGroup :: [ExtendedRegEx m] -> Pattern m
 - shiftMatchList :: DataLength -> MatchList m -> MatchList m
 - showsMaybeAnchor :: Maybe Anchor -> String -> String
 - simply :: Meta m -> RepeatablePattern m
 - transformExtendedRegEx :: (Concatenation m -> Concatenation m) -> Transformation m
 - (+~) :: (Eq m, NFData m) => InputData m -> RegExOpts (ExtendedRegEx m) -> Result m
 - (=~) :: (Eq m, NFData m) => InputData m -> RegExOpts (ExtendedRegEx m) -> Bool
 - (/~) :: (Eq m, NFData m) => InputData m -> RegExOpts (ExtendedRegEx m) -> Bool
 - (.*) :: RepeatablePattern m
 - (.*?) :: RepeatablePattern m
 - isDefined :: ExtendedRegEx m -> Bool
 - isCaptureGroup :: Pattern m -> Bool
 - isSingletonAlternatives :: Alternatives m -> Bool
 - externalMatchLength :: ExternalMatch m -> DataLength
 - extractDataFromMatch :: Match m -> InputData m
 - extractDataFromMatch' :: Maybe (Match m) -> InputData m
 - extractDataFromMatchList :: MatchList m -> InputData m
 
Type-classes
class ShortcutExpander m where Source #
- Defines the method required to expand a mnemonic into an 
ExtendedRegEx. - CAVEAT: this interface must be declared locally, since it references 
ExtendedRegEx, &ExtendedRegExreferences it. 
Methods
Arguments
| :: Char | |
| -> ExtendedRegEx m | Expand a single-  | 
Instances
| ShortcutExpander Int Source # | |
Defined in RegExDot.InstanceInt  | |
Types
Type-synonyms
type Concatenation m = [RepeatablePattern m] Source #
Represents the concatenation aspect of ExtendedRegExs.
type ExternalMatch m = Maybe (Match m) Source #
At the top-level of an ExtendedRegEx, the lack of an Anchor allows the ExtendedRegEx to drift away from the corresponding end of the input-data; this data-gap is captured here.
type InputData m = [m] Source #
- The input-data is just a list.
 - Whilst typically this list is also a 
String, & could therefore be more efficiently implemented using Data.ByteString, we can't assume that the polymorphic base-type is alwaysChar. 
type MatchedData m = (RepeatablePattern m, DataLength, InputData m) Source #
Tag the InputData with the RepeatablePattern it matched (which unfortunately confines the definition to this (bloated) module), & the offset from the start of the data;
type MatchList m = [Match m] Source #
Describes the manner in which a Concatenation successfully consumed InputData.
type RepeatablePattern m = Repeatable (Pattern m) Source #
Make Patterns, Repeatable.
Data-types
newtype Alternatives m Source #
- Represents the alternation feature of 
ExtendedRegExs. - One could amalgamate this with 
Pattern, since it seems to exist merely as a peg to hang instance-declarations from. 
Constructors
| MkAlternatives | |
Fields  | |
Instances
type Match m = Tree (MatchedData m) Source #
Describes the manner in which a RepeatablePattern successfully consumed InputData.
data ExtendedRegEx m Source #
Constructs an ExtendedRegEx, by surrounding a Concatenation with optional Anchors.
Constructors
| MkExtendedRegEx | |
Fields 
  | |
Instances
Defines either a simple Meta, which can match exactly one datum, or a set of Alternatives, each of which is recursively defined above, as an ExtendedRegEx.
Constructors
| Require (Meta m) | Describes a requirement for a simple scalar datum of the polymorphic type.  | 
| CaptureGroup (Alternatives m) | A sub-expression containing a selection of recursively defined alternatives, thus forming a tree-structure.  | 
Instances
| Eq m => Eq (Pattern m) Source # | |
| (Eq m, ShortcutExpander m, Read m, ShortcutExpander m) => Read (Pattern m) Source # | |
| Show m => Show (Pattern m) Source # | |
| NFData m => NFData (Pattern m) Source # | |
Defined in RegExDot.RegEx  | |
| SelfValidator (Pattern m) Source # | |
| Consumer (Pattern m) Source # | |
Defined in RegExDot.RegEx Methods consumptionProfile :: Pattern m -> ConsumptionProfile Source # starHeight :: Pattern m -> StarHeight Source #  | |
type Result m = (ExternalMatch m, Maybe (MatchList m), ExternalMatch m) Source #
Captures the list of input-data consumed by the Concatenation, bracketed by any data-prefix or data-suffix.
Constants
alternativeExtendedRegExSeparatorToken :: Char Source #
The token used to separate alternative ExtendedRegExs, when in the String-form.
captureGroupDelimiters :: (Char, Char) Source #
The delimiters of Alternatives, when in the String-form.
Functions
dock :: Transformation m Source #
Drop Anchors at both bow & stern of the specified ExtendedRegEx.
captureGroup :: [ExtendedRegEx m] -> Pattern m Source #
Convenience-function to build a CaptureGroup from a list of alternative ExtendedRegExs.
Arguments
| :: DataLength | The offset by which to shift the position into the input-data at which a each listed match occurred.  | 
| -> MatchList m | The list of match-structures, each of whose offsets are to be shifted.  | 
| -> MatchList m | 
Shifts the offsets of all the MatchedData contained in the specified MatchList.
simply :: Meta m -> RepeatablePattern m Source #
Convenience-function, to build a RepeatablePattern from an unrepeated instance of the specified Meta-datum.
transformExtendedRegEx Source #
Arguments
| :: (Concatenation m -> Concatenation m) | The function used to transform the data behind the constructor.  | 
| -> Transformation m | 
Similar to fmap, but operates on Concatenation, rather than just a.
Accessors
Operators
Arguments
| :: (Eq m, NFData m) | |
| => InputData m | The input data within which to locate a match.  | 
| -> RegExOpts (ExtendedRegEx m) | The match-options parameterised by the regex against which to match the input data.  | 
| -> Result m | 
- Operator's name was chosen to suggest something more than 
=~. - CAVEAT: much more expensive then 
=~: in ghci,Justcan be observed to be printed long before theMatchListfrom whichResultis constructed, as the lazy algorithm finds the first solution, but not yet necessarily the optimal solution, amongstAlternatives. 
Arguments
| :: (Eq m, NFData m) | |
| => InputData m | The input data within which to locate a match.  | 
| -> RegExOpts (ExtendedRegEx m) | The match-options parameterised by the regex against which to match the input data.  | 
| -> Bool | 
- Pattern-match operator.
 - Identifier & parameter-order follow the lead of Perl's pattern-match operator.
 - Considerably more efficient than 
+~, since even though they are both implemented viafindMatch, the discovery of any solution is sufficient to generate the return-value; lazy-evaluation avoids the requirement to identify the irrelevant optimal solution. 
Arguments
| :: (Eq m, NFData m) | |
| => InputData m | The input data within which to locate a match.  | 
| -> RegExOpts (ExtendedRegEx m) | The match-options parameterised by the regex against which to match the input data.  | 
| -> Bool | 
Pattern-mismatch operator.
(.*) :: RepeatablePattern m Source #
- Represents a black hole, which will greedily consume all data.
 - CAVEAT: nullary, i.e. a constant.
 
(.*?) :: RepeatablePattern m Source #
A non-greedy version of .*.
Predicates
isDefined :: ExtendedRegEx m -> Bool Source #
True if there's at least one RepeatablePattern in the Concatenation, i.e. that it's non-null.
isCaptureGroup :: Pattern m -> Bool Source #
True if the Pattern was constructed via CaptureGroup.
isSingletonAlternatives :: Alternatives m -> Bool Source #
Alternatives can be employed as a simple capture-group as well as a switch, under which circumstances there's no choice amongst multiple Alternatives.
Query
externalMatchLength :: ExternalMatch m -> DataLength Source #
Returns the length of data consumed by the specified ExternalMatch.
extractDataFromMatch :: Match m -> InputData m Source #