regexdot-0.10.2.1: A /polymorphic/, POSIX, extended regular-expression engine.

RegExDot.RegEx

Contents

Description

AUTHOR
Dr. Alistair Ward
DESCRIPTION
  • This implementation of extended regexes, generalises the familiar concept of pattern-matching of character-strings, to matching lists composed from an arbitrary data-type. The polymorphic data, from which the input data-list is composed, need only support Eq & Show.
  • Because of the unknown stringified form of the underlying polymorphic data, the regex must be described by a comparatively verbose bracketed & comma-separated list, rather than the traditional String containing Meta-characters. Each element of this Concatenation is a RepeatablePattern, which describes a permissible match against InputData.
  • RepeatablePattern can take one of two forms. In the simplest case, it matches just a single item of the underlying polymorphic type, perhaps literally, though looser specifications also exist: . matches any input datum; [x, y, z] matches any of x, y, or z; [^x, y, z] matches anything but x, y, or z. To support POSIX EREs, RepeatablePattern can also be a list Alternatives, each of which is recursively defined as an ExtendedRegEx, to form a tree-structure.
  • Each Pattern, can optionally be quantified by either a traditional greedy, or a Perl-style non-greedy, suffix, e.g.; [*, +, ?, {n, m}, {n,}, {n}, *?, +?, ??, {n, m}?, {n,}?].
  • For convenience, common specifications can be canned & assigned a single Char mnemonic, for subsequent reference. Since ExtendedRegEx is polymorphic, the set of abbreviations appropriate in the context of the unspecified base-type, must be implemented externally through the ShortcutExpander interface. This permits the use, when the type-parameter is Char, of Perl-style shortcuts [\d\D\s\S\w\W].
  • The algorithm, is the classic back-tracking one, rather than either a DFA or NFA. This permits construction of Result via which one can discover the deep mapping of InputData into ExtendedRegEx, & provides the flexibility to add the features now expected by modern regex-engines. Since the type-parameter is unknown, & may represent a large object, the exponential space-complexity of creating a DFA may present additional problems. The exponential time-complexity of the back-tracking algorithm is partially tamed by targeting obvious inefficiencies with specific optimisations.
  • Char-based regexen, traditionally overload the delimiters of a set of Alternatives (parentheses), as a request for data-capture. Here, in contrast, all RepeatablePatterns capture data, & repeated sub-expressions capture a list of data, rather than arbitrarily recording just the last (http://www.opengroup.org/onlinepubs/009695399/functions/regcomp.html) item.
REFERENCES
CAVEATS
  • Because of the definition of mutually recursive data-types, it is difficult to split this annoyingly large module, & preserve compatibility across compilers, but it may be possible to break this cyclic dependency, by defining an interface to which one of the data-types defined here conforms.
  • Doesn't implement Back-references, making the definition of the ExtendedRegEx context-free.
  • There's no integration with the type-classes defined in Text.Regex.Base.RegexLike, which assumes Char-based InputData; though this could be added to a specialised instance.
  • When Alternatives are defined, Result becomes a tree-like structure. Unless the alternative is a singleton, the specific alternative selected in the solution is typically unknown, & therefore the structure of the branch of this tree is also unknown. This lack of clarity is compounded when the Alternatives are Repeatable, since a different one may be selected on each successive repetition. Consequently, the user can't navigate this portion of the structure in a statically defined manner, to acquire the captured data. Despite this, & in contrast to other regex-engines, access to the whole data-structure is available, since it doesn't seem advantage to hide it. The user can then either use extractDataFromMatch for that element of Result, thus aggregating the data from sections of unknown structure, or show it, as an aid to debugging.
TODO
  • Test parallel-operation, on a 3 or more processor machine. If rnf is less effective than rwhnf, then the NFData context can be removed, reducing the requirements imposed on the type-parameter a.
  • Try Data.List.Stream (stream-fusion), a faster drop-in replacement for Data.List; possibly integrated in GHC-6.12.
  • bypassInputDataForLiberalConsumer is too restrictive. More generally, we can test whether the set of different a in InputData, is a subset of those common to all remaining terms in the ExtendedRegEx. Using this rule, we can infer aaa ... =~ MkExtendedRegEx [a,a+,a?,[ab]{2,3}], given compatible consumptionBounds.
  • Nested repetitions, where nothing has been added to the expression, result in repeated trials of the same expression, e.g.; (x{i,}){j,} results in the same expansion for (i, j) in [(2, 3), (3, 2), (6, 1), (1, 6)]. The resulting MatchList may be different, but if the first such trial fails, so will all the remainder.
  • Should cope with empty sets of Alternatives & zero repetitions, neither of which can ever match, but the wider pattern can, e.g. (()|x{0}|y).
  • By removing RepeatablePattern from Match, it can be isolated in a new module. This would result in a significant loss of discoverability.
  • Expand repeated Bow with fewest - 1 null matches followed by recursive findMatch-call with repetitions = 1.

Synopsis

Type-classes

class ShortcutExpander m whereSource

Methods

expandSource

Arguments

:: Char 
-> ExtendedRegEx m

Expand a single-Char mnemonic into the corresponding ExtendedRegEx.

Types

Type-synonyms

type Concatenation m = [RepeatablePattern m]Source

Represents the concatenation aspect of ExtendedRegExs.

type ExternalMatch m = Maybe (Match m)Source

At the top-level of an ExtendedRegEx, the lack of an Anchor allows the ExtendedRegEx to drift away from the corresponding end of the input-data; this data-gap is captured here.

type InputData m = [m]Source

  • The input-data is just a list.
  • Whilst typically this list is also a String, & could therefore be more efficiently implemented using Data.ByteString, we can't assume that the polymorphic base-type is always Char.

type MatchedData m = (RepeatablePattern m, DataLength, InputData m)Source

Tag the InputData with the RepeatablePattern it matched (which unfortunately confines the definition to this (bloated) module), & the offset from the start of the data;

type MatchList m = [Match m]Source

Describes the manner in which a Concatenation successfully consumed InputData.

Data-types

newtype Alternatives m Source

  • Represents the alternation feature of ExtendedRegExs.
  • One could amalgamate this with Pattern, since it seems to exist merely as a peg to hang instance-declarations from.

Constructors

MkAlternatives [ExtendedRegEx m] 

type Match m = Tree (MatchedData m)Source

Describes the manner in which a RepeatablePattern successfully consumed InputData.

data ExtendedRegEx m Source

Constructs an ExtendedRegEx, by surrounding a Concatenation with optional Anchors.

Constructors

MkExtendedRegEx 

Fields

bowAnchor :: Maybe Anchor

An option to anchor the regex to the start of the InputData.

concatenation :: Concatenation m

The sequence of RepeatablePatterns defining the Requirements that the InputData must meet.

sternAnchor :: Maybe Anchor

An option to anchor the regex to the end of the InputData.

data Pattern m Source

Defines either a simple Meta, which can match exactly one datum, or a set of Alternatives, each of which is recursively defined above, as an ExtendedRegEx.

Constructors

Require (Meta m)

Describes a requirement for a simple scalar datum of the polymorphic type.

CaptureGroup (Alternatives m)

A sub-expression containing a selection of recursively defined alternatives, thus forming a tree-structure.

Instances

type Result m = (ExternalMatch m, Maybe (MatchList m), ExternalMatch m)Source

Captures the list of input-data consumed by the Concatenation, bracketed by any data-prefix or data-suffix.

Constants

alternativeExtendedRegExSeparatorToken :: CharSource

The token used to separate alternative ExtendedRegExs, when in the String-form.

captureGroupDelimiters :: (Char, Char)Source

The delimiters of Alternatives, when in the String-form.

tokens :: StringSource

The set of Char to which a specific meaning is attributed, when reading from String.

Functions

dock :: Transformation mSource

Drop Anchors at both bow & stern of the specified ExtendedRegEx.

captureGroup :: [ExtendedRegEx m] -> Pattern mSource

Convenience-function to build a CaptureGroup from a list of alternative ExtendedRegExs.

shiftMatchListSource

Arguments

:: DataLength

The offset by which to shift the position into the input-data at which a each listed match occurred.

-> MatchList m

The list of match-structures, each of whose offsets are to be shifted.

-> MatchList m 

Shifts the offsets of all the MatchedData contained in the specified MatchList.

showsMaybeAnchor :: Maybe Anchor -> String -> StringSource

Shows either the specified Anchor, or a null string where Nothing is specified.

simply :: Meta m -> RepeatablePattern mSource

Convenience-function, to build a RepeatablePattern from an unrepeated instance of the specified Meta-datum.

transformExtendedRegExSource

Arguments

:: (Concatenation m -> Concatenation m)

The function used to transform the data behind the constructor.

-> Transformation m 

Similar to fmap, but operates on Concatenation, rather than just a.

Accessors (Deconstructors)

deconstructAlternatives :: Alternatives m -> [ExtendedRegEx m]Source

Accessor, to expose the guts.

Operators

(+~)Source

Arguments

:: (Eq m, NFData m, Show m) 
=> InputData m

The input data within which to locate a match.

-> RegExOpts (ExtendedRegEx m)

The match-options parameterised by the regex against which to match the input data.

-> Result m 
  • Operator's name was chosen to suggest something more than =~.
  • CAVEAT: much more expensive then =~: in ghci, Just can be observed to be printed long before the MatchList from which Result is constructed, as the lazy algorithm finds the first solution, but not yet necessarily the optimal solution, amongst Alternatives.

(=~)Source

Arguments

:: (Eq m, NFData m, Show m) 
=> InputData m

The input data within which to locate a match.

-> RegExOpts (ExtendedRegEx m)

The match-options parameterised by the regex against which to match the input data.

-> Bool 
  • Pattern-match operator.
  • Identifier & parameter-order follow the lead of Perl's pattern-match operator.
  • Considerably more efficient than +~, since even though they are both implemented via findMatch, the discovery of any solution is sufficient to generate the return-value; lazy-evaluation avoids the requirement to identify the irrelevant optimal solution.

(/~)Source

Arguments

:: (Eq m, NFData m, Show m) 
=> InputData m

The input data within which to locate a match.

-> RegExOpts (ExtendedRegEx m)

The match-options parameterised by the regex against which to match the input data.

-> Bool 

Pattern-mismatch operator.

(.*) :: RepeatablePattern mSource

  • Represents a black hole, which will greedily consume all data.
  • CAVEAT: nullary, ie a constant.

(.*?) :: RepeatablePattern mSource

A non-greedy version of .*.

Predicates

isDefined :: ExtendedRegEx m -> BoolSource

True if there's at least one RepeatablePattern in the Concatenation, ie that it's non-null.

isCaptureGroup :: Pattern m -> BoolSource

True if the Pattern was constructed via CaptureGroup.

isSingletonAlternatives :: Alternatives m -> BoolSource

Alternatives can be employed as a simple capture-group as well as a switch, under which circumstances there's no choice amongst multiple Alternatives.

Query

externalMatchLength :: ExternalMatch m -> DataLengthSource

Returns the length of data consumed by the specified ExternalMatch.

extractDataFromMatch :: Match m -> InputData mSource

Extract & concatenate, the InputData from a Match.

extractDataFromMatch' :: Maybe (Match m) -> InputData mSource

Extract & concatenate, the InputData from a Match; null if it didn't match any.

extractDataFromMatchList :: MatchList m -> InputData mSource

Extract & concatenate, the InputData, from the MatchList.