regexdot- A polymorphic, POSIX, extended regex-engine.

Safe HaskellNone




Dr. Alistair Ward
  • This implementation of extended regexes, generalises the familiar concept of pattern-matching of character-strings, to matching lists composed from an arbitrary data-type. The polymorphic data, from which the input data-list is composed, need only support Eq.
  • Because of the unknown stringified form of the underlying polymorphic data, the regex must be described by a comparatively verbose bracketed & comma-separated list, rather than the traditional String containing Meta-characters. Each element of this Concatenation is a RepeatablePattern, which describes a permissible match against InputData.
  • RepeatablePattern can take one of two forms. In the simplest case, it matches just a single item of the underlying polymorphic type, perhaps literally, though looser specifications also exist: . matches any input datum; [x, y, z] matches any of x, y, or z; [^x, y, z] matches anything but x, y, or z. To support POSIX EREs, RepeatablePattern can also be a list Alternatives, each of which is recursively defined as an ExtendedRegEx, to form a tree-structure.
  • Each Pattern, can optionally be quantified by either a traditional greedy, or a Perl-style non-greedy, suffix, e.g.; [*, +, ?, {n, m}, {n,}, {n}, *?, +?, ??, {n, m}?, {n,}?].
  • For convenience, common specifications can be canned & assigned a single Char mnemonic, for subsequent reference. Since ExtendedRegEx is polymorphic, the set of abbreviations appropriate in the context of the unspecified base-type, must be implemented externally through the ShortcutExpander interface. This permits the use, when the type-parameter is Char, of Perl-style shortcuts [\d\D\s\S\w\W].
  • The algorithm, is the classic back-tracking one, rather than either a DFA or NFA. This permits construction of Result via which one can discover the deep mapping of InputData into ExtendedRegEx, & provides the flexibility to add the features now expected by modern regex-engines. Since the type-parameter is unknown, & may represent a large object, the exponential space-complexity of creating a DFA may present additional problems. The exponential time-complexity of the back-tracking algorithm is partially tamed by targeting obvious inefficiencies with specific optimisations.
  • Char-based regexen, traditionally overload the delimiters of a set of Alternatives (parentheses), as a request for data-capture. Here, in contrast, all RepeatablePatterns capture data, & repeated sub-expressions capture a list of data, rather than arbitrarily recording just the last ( item.
  • Because of the definition of mutually recursive data-types, it is difficult to split this annoyingly large module, & preserve compatibility across compilers, but it may be possible to break this cyclic dependency, by defining an interface to which one of the data-types defined here conforms.
  • Doesn't implement Back-references, making the definition of the ExtendedRegEx context-free.
  • There's no integration with the type-classes defined in Text.Regex.Base.RegexLike, which assumes Char-based InputData; though this could be added to a specialised instance.
  • When Alternatives are defined, Result becomes a tree-like structure. Unless the alternative is a singleton, the specific alternative selected in the solution is typically unknown, & therefore the structure of the branch of this tree is also unknown. This lack of clarity is compounded when the Alternatives are Repeatable, since a different one may be selected on each successive repetition. Consequently, the user can't navigate this portion of the structure in a statically defined manner, to acquire the captured data. Despite this, & in contrast to other regex-engines, access to the whole data-structure is available, since it doesn't seem advantage to hide it. The user can then either use extractDataFromMatch for that element of Result, thus aggregating the data from sections of unknown structure, or show it, as an aid to debugging.
  • Test parallel-operation, on a 3 or more processor machine. If rnf is less effective than rwhnf, then the NFData context can be removed, reducing the requirements imposed on the type-parameter a.
  • Try Stream (stream-fusion), a faster drop-in replacement for List; possibly integrated in GHC-6.12.
  • bypassInputDataForLiberalConsumer is too restrictive. More generally, we can test whether the set of different a in InputData, is a subset of those common to all remaining terms in the ExtendedRegEx. Using this rule, we can infer "aaa ..." =~ MkExtendedRegEx [a,a+,a?,[ab]{2,3}], given compatible consumptionBounds.
  • Nested repetitions, where nothing has been added to the expression, result in repeated trials of the same expression, e.g.; "(x{i,}){j,}" results in the same expansion for (i, j) in [(2, 3), (3, 2), (6, 1), (1, 6)]. The resulting MatchList may be different, but if the first such trial fails, so will all the remainder.
  • Should cope with empty sets of Alternatives & zero repetitions, neither of which can ever match, but the wider pattern can, e.g. (()|x{0}|y).
  • By removing RepeatablePattern from Match, it can be isolated in a new module. This would result in a significant loss of discoverability.
  • Expand repeated Bow with fewest - 1 null matches followed by recursive findMatch-call with repetitions = 1.



class ShortcutExpander m where Source


expand Source


:: Char 
-> ExtendedRegEx m

Expand a single-Char mnemonic into the corresponding ExtendedRegEx.



type Concatenation m = [RepeatablePattern m] Source

Represents the concatenation aspect of ExtendedRegExs.

type ExternalMatch m = Maybe (Match m) Source

At the top-level of an ExtendedRegEx, the lack of an Anchor allows the ExtendedRegEx to drift away from the corresponding end of the input-data; this data-gap is captured here.

type InputData m = [m] Source

  • The input-data is just a list.
  • Whilst typically this list is also a String, & could therefore be more efficiently implemented using Data.ByteString, we can't assume that the polymorphic base-type is always Char.

type MatchedData m = (RepeatablePattern m, DataLength, InputData m) Source

Tag the InputData with the RepeatablePattern it matched (which unfortunately confines the definition to this (bloated) module), & the offset from the start of the data;

type MatchList m = [Match m] Source

Describes the manner in which a Concatenation successfully consumed InputData.


newtype Alternatives m Source

  • Represents the alternation feature of ExtendedRegExs.
  • One could amalgamate this with Pattern, since it seems to exist merely as a peg to hang instance-declarations from.


MkAlternatives [ExtendedRegEx m] 

type Match m = Tree (MatchedData m) Source

Describes the manner in which a RepeatablePattern successfully consumed InputData.

data ExtendedRegEx m Source

Constructs an ExtendedRegEx, by surrounding a Concatenation with optional Anchors.




bowAnchor :: Maybe Anchor

An option to anchor the regex to the start of the InputData.

concatenation :: Concatenation m

The sequence of RepeatablePatterns defining the Requirements that the InputData must meet.

sternAnchor :: Maybe Anchor

An option to anchor the regex to the end of the InputData.

data Pattern m Source

Defines either a simple Meta, which can match exactly one datum, or a set of Alternatives, each of which is recursively defined above, as an ExtendedRegEx.


Require (Meta m)

Describes a requirement for a simple scalar datum of the polymorphic type.

CaptureGroup (Alternatives m)

A sub-expression containing a selection of recursively defined alternatives, thus forming a tree-structure.

type Result m = (ExternalMatch m, Maybe (MatchList m), ExternalMatch m) Source

Captures the list of input-data consumed by the Concatenation, bracketed by any data-prefix or data-suffix.


alternativeExtendedRegExSeparatorToken :: Char Source

The token used to separate alternative ExtendedRegExs, when in the String-form.

captureGroupDelimiters :: (Char, Char) Source

The delimiters of Alternatives, when in the String-form.

tokens :: String Source

The set of Char to which a specific meaning is attributed, when reading from String.


dock :: Transformation m Source

Drop Anchors at both bow & stern of the specified ExtendedRegEx.

captureGroup :: [ExtendedRegEx m] -> Pattern m Source

Convenience-function to build a CaptureGroup from a list of alternative ExtendedRegExs.

shiftMatchList Source


:: DataLength

The offset by which to shift the position into the input-data at which a each listed match occurred.

-> MatchList m

The list of match-structures, each of whose offsets are to be shifted.

-> MatchList m 

Shifts the offsets of all the MatchedData contained in the specified MatchList.

showsMaybeAnchor :: Maybe Anchor -> String -> String Source

Shows either the specified Anchor, or a null string where Nothing is specified.

simply :: Meta m -> RepeatablePattern m Source

Convenience-function, to build a RepeatablePattern from an unrepeated instance of the specified Meta-datum.

transformExtendedRegEx Source


:: (Concatenation m -> Concatenation m)

The function used to transform the data behind the constructor.

-> Transformation m 

Similar to fmap, but operates on Concatenation, rather than just a.

Accessors (Deconstructors)

deconstructAlternatives :: Alternatives m -> [ExtendedRegEx m] Source

Accessor, to expose the guts.


(+~) infix 4 Source


:: (Eq m, NFData m) 
=> InputData m

The input data within which to locate a match.

-> RegExOpts (ExtendedRegEx m)

The match-options parameterised by the regex against which to match the input data.

-> Result m 
  • Operator's name was chosen to suggest something more than =~.
  • CAVEAT: much more expensive then =~: in ghci, Just can be observed to be printed long before the MatchList from which Result is constructed, as the lazy algorithm finds the first solution, but not yet necessarily the optimal solution, amongst Alternatives.

(=~) infix 4 Source


:: (Eq m, NFData m) 
=> InputData m

The input data within which to locate a match.

-> RegExOpts (ExtendedRegEx m)

The match-options parameterised by the regex against which to match the input data.

-> Bool 
  • Pattern-match operator.
  • Identifier & parameter-order follow the lead of Perl's pattern-match operator.
  • Considerably more efficient than +~, since even though they are both implemented via findMatch, the discovery of any solution is sufficient to generate the return-value; lazy-evaluation avoids the requirement to identify the irrelevant optimal solution.

(/~) infix 4 Source


:: (Eq m, NFData m) 
=> InputData m

The input data within which to locate a match.

-> RegExOpts (ExtendedRegEx m)

The match-options parameterised by the regex against which to match the input data.

-> Bool 

Pattern-mismatch operator.

(.*) :: RepeatablePattern m Source

  • Represents a black hole, which will greedily consume all data.
  • CAVEAT: nullary, i.e. a constant.

(.*?) :: RepeatablePattern m Source

A non-greedy version of .*.


isDefined :: ExtendedRegEx m -> Bool Source

True if there's at least one RepeatablePattern in the Concatenation, i.e. that it's non-null.

isCaptureGroup :: Pattern m -> Bool Source

True if the Pattern was constructed via CaptureGroup.

isSingletonAlternatives :: Alternatives m -> Bool Source

Alternatives can be employed as a simple capture-group as well as a switch, under which circumstances there's no choice amongst multiple Alternatives.


externalMatchLength :: ExternalMatch m -> DataLength Source

Returns the length of data consumed by the specified ExternalMatch.

extractDataFromMatch :: Match m -> InputData m Source

Extract & concatenate, the InputData from a Match.

extractDataFromMatch' :: Maybe (Match m) -> InputData m Source

Extract & concatenate, the InputData from a Match; null if it didn't match any.

extractDataFromMatchList :: MatchList m -> InputData m Source

Extract & concatenate, the InputData, from the MatchList.