pcre2-1.1.5: Regular expressions via the PCRE2 C library (included)
Safe HaskellNone
LanguageHaskell2010

Text.Regex.Pcre2

Synopsis

Matching and substitution

Introduction

Atop the low-level binding to the C API, we present a high-level interface to add regular expressions to Haskell programs.

All input and output strings are strict Text, which maps directly to how PCRE2 operates on strings of 16-bit-wide code units.

The C API requires pattern strings to be compiled and the compiled patterns to be executed on subject strings in discrete steps. We hide this procedure, accepting pattern and subject as arguments in a single function, essentially:

pattern -> subject -> result

The implementation guarantees that, when partially applied to pattern but not subject, the resulting function will close on the underlying compiled object and reuse it for every subject it is subsequently applied to.

Likewise, we do not require the user to know whether a PCRE2 option is to be applied at pattern compile time or match time. Instead we fold all possible options into a single datatype, Option. Most functions have vanilla and configurable variants; the latter have "Opt" in the name and accept a value of this type.

Similar to how head :: [a] -> a sacrifices totality for type simplicity, we represent user errors as imprecise exceptions. Unlike with head, these exceptions are typed (as SomePcre2Exceptions); moreover, we offer Template Haskell facilities that can intercept some of these errors before the program is run. (Failure to match is not considered a user error and is represented in the types.)

There's more than one way to do it with this library. The choices between functions and traversals, poly-kinded Captures and plain lists, string literals and quasi-quotations, quasi-quoted expressions and quasi-quoted patterns...these are left to the user. She will observe that advanced features' extra safety, power, and convenience entail additional language extensions, cognitive overhead, and (for lenses) library dependencies, so it's really a matter of finding the best trade-offs for her case.

Definitions

Pattern
The string defining a regular expression. Refer to syntax here.
Subject
The string the compiled regular expression is executed on.
Regex
A function of the form Text -> result, where the argument is the subject. It is "compiled" via partial application as discussed above. (Lens users: A regex has the more abstract form Traversal' Text result, but the concept is the same.)
Capture (or capture group)
Any substrings of the subject matched by the pattern, meaning the whole pattern and any parenthesized groupings. The PCRE2 docs do not refer to the former as a "capture"; however it is accessed the same way in the C API, just with index 0, so we will consider it the 0th capture for consistency. Parenthesized captures are implicitly numbered from 1.
Unset capture
A capture considered unset as distinct from empty. This can arise from matching the pattern (a)? to an empty subject—the 0th capture will be set as empty, but the 1st will be unset altogether. We represent both as empty Text for simplicity. See below for discussions about how unset captures may be detected or substituted using this library.
Named capture
A parenthesized capture can be named like this: (?<foo>...). Whether they have names or not, captures are always numbered as described above.

Performance

Each API function is designed such that, when a regex is obtained, the underlying C data generated from the pattern and any options is reused for that regex's lifetime. Care should be taken that the same regex is not recreated ex nihilo and discarded for each new subject:

isEmptyOrHas2Digits :: Text -> Bool
isEmptyOrHas2Digits s = Text.null s || matches "\\d{2}" s -- bad, fully applied

Instead, store it in a partially applied state:

isEmptyOrHas2Digits = (||) <$> Text.null <*> matches "\\d{2}" -- OK but abstruse

When in doubt, always create regexes as top-level values:

has2Digits :: Text -> Bool
has2Digits = matches "\\d{2}"

isEmptyOrHas2Digits s = Text.null s || has2Digits s -- good

Note: Template Haskell regexes are immune from this problem and may be freely inlined; see below.

Handling errors

In a few places we use the Alternative typeclass to optionally return match results, expressing success via pure and failure via empty. Typically the user will choose the instance Maybe, but other useful ones exist, notably STM, that of optparse-applicative, and those of parser combinator libraries such as megaparsec.

By contrast, user errors are thrown purely. If a user error is to be caught, it must be at the site where the match or substitution results are evaluated—in other words, wherever the regex is applied to a subject. Even pattern compile errors are deferred to match sites, due to the way this library employs unsafePerformIO to implement laziness.

>>> broken = match "*"
>>> broken "foo"
*** Exception: pcre2_compile: quantifier does not follow a repeatable item
                    *
                    ^

evaluate comes in handy to force results into the IO monad in order to catch errors reliably:

>>> evaluate (broken "foo") `catch` \(_ :: SomePcre2Exception) -> return Nothing
Nothing

Or simply select IO as the Alternative instance:

>>> :t broken
broken :: Alternative f => Text -> f Text
>>> broken "foo" `catch` \(_ :: SomePcre2Exception) -> return "broken"
"broken"

Basic matching functions

match :: Alternative f => Text -> Text -> f Text Source #

Match a pattern to a subject once and return the portion that matched in an Alternative, or empty if no match.

matchOpt :: Alternative f => Option -> Text -> Text -> f Text Source #

matchOpt mempty = match

matchAll :: Text -> Text -> [Text] Source #

Match a pattern to a subject and lazily return a list of all non-overlapping portions that matched.

Since: 1.1.0

matchAllOpt :: Option -> Text -> Text -> [Text] Source #

matchAllOpt mempty = matchAll

Since: 1.1.0

matches :: Text -> Text -> Bool Source #

Does the pattern match the subject at least once?

matchesOpt :: Option -> Text -> Text -> Bool Source #

matchesOpt mempty = matches

captures :: Text -> Text -> [Text] Source #

Match a pattern to a subject once and return a list of captures, or [] if no match.

capturesOpt :: Option -> Text -> Text -> [Text] Source #

capturesOpt mempty = captures

capturesA :: Alternative f => Text -> Text -> f (NonEmpty Text) Source #

Match a pattern to a subject once and return a non-empty list of captures in an Alternative, or empty if no match. The non-empty list constructor :| serves as a cue to differentiate the 0th capture from the others:

let parseDate = capturesA "(\\d{4})-(\\d{2})-(\\d{2})"
in case parseDate "submitted 2020-10-20" of
    Just (date :| [y, m, d]) -> ...
    Nothing                  -> putStrLn "didn't match"

capturesAOpt :: Alternative f => Option -> Text -> Text -> f (NonEmpty Text) Source #

capturesAOpt mempty = capturesA

Since: 1.1.0

capturesAll :: Text -> Text -> [NonEmpty Text] Source #

Match a pattern to a subject and lazily produce a list of all non-overlapping portions, with all capture groups, that matched.

Since: 1.1.0

capturesAllOpt :: Option -> Text -> Text -> [NonEmpty Text] Source #

capturesAllOpt mempty = capturesAll

Since: 1.1.0

PCRE2-native substitution

sub Source #

Arguments

:: Text

pattern

-> Text

replacement

-> Text

subject

-> Text

result

Perform at most one substitution. See the docs for the special syntax of replacement.

>>> sub "(\\w+) calling the (\\w+)" "$2 calling the $1" "the pot calling the kettle black"
"the kettle calling the pot black"

gsub :: Text -> Text -> Text -> Text Source #

Perform substitutions globally.

>>> gsub "a" "o" "apples and bananas"
"opples ond bononos"

subOpt :: Option -> Text -> Text -> Text -> Text Source #

subOpt mempty = sub
subOpt SubGlobal = gsub

Lens-powered matching and substitution

To use this portion of the library, there are two prerequisites:

  1. A basic working understanding of optics. Many tutorials exist online, such as this, and videos such as this.
  2. A library providing combinators. For lens newcomers, it is recommended to grab microlens-platform—most of the examples in this library work with it, packed and unpacked are included for working with Text, and it is upwards-compatible with the full lens library.

We expose a set of traversals that focus on matched substrings within a subject. Like the basic functional regexes, they should be "compiled" and memoized, rather than created inline.

_nee :: Traversal' Text Text
_nee = _match "(?i)\\bnee\\b"

In addition to getting results, they support global substitution through setting; more generally, they can accrete effects while performing replacements.

>>> promptNee = traverseOf (_nee . unpacked) $ \s -> print s >> getLine
>>> promptNee "We are the knights who say...NEE!"
"NEE"
NOO
"We are the knights who say...NOO!"
>>> 

In general these traversals are not law-abiding.

_match :: Text -> Traversal' Text Text Source #

Given a pattern, produce a traversal (0 or more targets) that focuses from a subject to the portions of it that match.

_match = _captures patt . ix 0

_matchOpt :: Option -> Text -> Traversal' Text Text Source #

_matchOpt mempty = _match

_captures :: Text -> Traversal' Text (NonEmpty Text) Source #

Given a pattern, produce a traversal (0 or more targets) that focuses from a subject to each non-empty list of captures that pattern matches globally.

Substitution works in the following way: If a capture is set such that the new Text is not equal to the old one, a substitution occurs, otherwise it doesn't. This matters in cases where a capture encloses another capture—notably, all parenthesized captures are enclosed by the 0th.

>>> threeAndMiddle = _captures ". (.) ."
>>> "A A A" & threeAndMiddle .~ "A A A" :| ["B"]
"A B A"
>>> "A A A" & threeAndMiddle .~ "A B A" :| ["A"]
"A B A"

Changing multiple overlapping captures won't do what you want and is unsupported.

Changing an unset capture is unsupported because the PCRE2 match API does not give location info about it. Currently we ignore all such attempts. (Native substitution functions like sub do not have this limitation. See also SubUnknownUnset and SubUnsetEmpty.)

If the list becomes longer for some reason, the extra elements are ignored. If it's shortened, the absent elements are considered to be unchanged.

It's recommended that the list be modified capture-wise, using ix.

let madlibs = _captures "(\\w+) my (\\w+)"

print $ "Well bust my buttons!" &~ do
    zoom madlibs $ do
        ix 1 . _head .= 'd'
        ix 2 %= Text.reverse
    _last .= '?'

-- "Well dust my snottub?"

_capturesOpt :: Option -> Text -> Traversal' Text (NonEmpty Text) Source #

_capturesOpt mempty = _captures

Compile-time validation

Despite whatever virtues, the API thus far has some fragility arising from various scenarios:

  • pattern malformation such as mismatched parentheses (runtime error)
  • out-of-bounds indexing of a capture group list (runtime error)
  • out-of-bounds ixing of a Traversal' target (spurious failure to match)
  • case expression containing a Haskell list pattern of the wrong length (spurious failure to match)
  • regex created and discarded inline (suboptimal performance)
  • precariously many backslashes in a pattern. Matching a literal backslash requires the sequence "\\\\"!

Using a combination of language extensions and pattern introspection features, we provide a Template Haskell API to mitigate these scenarios. To make use of it these must be enabled:

ExtensionRequired forWhen
DataKinds Nats (numbers), Symbols (strings), and other type-level data powering compile-time capture lookupsUsing regex/_regex with a pattern containing parenthesized captures
QuasiQuotes[f|...|] syntaxAlways
TypeApplications @i syntax for supplying type index arguments to applicable functionsUsing capture/_capture
ViewPatterns Running code and binding variables in pattern context proper (pattern guards are off-limits for this)Using regex as a Haskell pattern

The inspiration for this portion of the library is Ruby, which supports regular expressions with superior ergonomics.

Quasi-quoters

regex :: QuasiQuoter Source #

As an expression

regex :: (Alternative f) => String -> Text -> f (Captures info)

in the presence of parenthesized captures, or

regex :: (Alternative f) => String -> Text -> f Text

if there are none. In other words, if there is more than the 0th capture, this behaves like capturesA (except returning an opaque Captures instead of a list), otherwise it behaves like match.

To retrieve an individual capture from a Captures, use capture.

case [regex|(?<y>\d{4})-(?<m>\d{2})-(?<d>\d{2})|] "submitted 2020-10-20" of
    Just cs ->
        let date = capture @0 cs
            year = read @Int $ Text.unpack $ capture @"y" cs
            ...
forM_ ([regex|\s+$|] line :: Maybe Text) $ \spaces -> error $
    "line has trailing spaces (" ++ show (Text.length spaces) ++ " characters)"

As a pattern

This matches when the regex first matches, whereupon any named captures are bound to variables of the same names.

case "submitted 2020-10-20" of
    [regex|(?<y>\d{4})-(?<m>\d{2})-(?<d>\d{2})|] ->
        let year = read @Int $ Text.unpack y
            ...

Note that it is not possible to access the 0th capture this way. As a workaround, explicitly capture the whole pattern and name it.

If there are no named captures, this simply acts as a guard.

_regex :: QuasiQuoter Source #

A global, optical variant of regex. Can only be used as an expression.

_regex :: String -> Traversal' Text (Captures info)
_regex :: String -> Traversal' Text Text
import Control.Lens
import Data.Text.Lens

embeddedNumber :: Traversal' String Int
embeddedNumber = packed . [_regex|\d+|] . unpacked . _Show

main :: IO ()
main = putStrLn $ "There are 14 competing standards" & embeddedNumber %~ (+ 1)

-- There are 15 competing standards

Type-indexed capture groups

data Captures (info :: CapturesInfo) Source #

A wrapper around a list of captures that carries additional type-level information about the number and names of those captures.

This type is only intended to be created by regex/_regex and consumed by capture/_capture, relying on type inference. Specifying the info explicitly in a type signature is not supported—the definition of CapturesInfo is not part of the public API and may change without warning.

After obtaining Captures it's recommended to immediately consume them and transform them into application-level data, to avoid leaking the types to top level and having to write signatures. In times of need, "Captures _" may be written with the help of {-# LANGUAGE PartialTypeSignatures #-}.

capture :: forall i info num. (CaptNum i info ~ num, KnownNat num) => Captures info -> Text Source #

Safely lookup a capture in a Captures result obtained from a Template Haskell-generated matching function.

The ugly type signature may be interpreted like this: Given some capture group index i and some info about a regex, ensure that index exists and is resolved to the number num at compile time. Then, at runtime, get a capture group from a list of captures.

In practice the variable i is specified by type application and the other variables are inferred.

capture @3
capture @"bar"

Specifying a nonexistant number or name will result in a type error.

_capture :: forall i info num. (CaptNum i info ~ num, KnownNat num) => Lens' (Captures info) Text Source #

Like capture but focus from a Captures to a capture.

Options

data Option Source #

A Monoid representing nearly every facility PCRE2 presents for tweaking the behavior of regex compilation and execution.

All library functions that take options have the suffix Opt in their names; for each of them, there's also a non-Opt convenience function that simply has the (unexported) mempty option. For many uses, options won't be needed.

Some options can be enabled by special character sequences in the pattern as an alternative to specifying them as an Option. See Caseless for example.

Documentation is scant here. For more complete, accurate information, including discussions of corner cases arising from specific combinations of options and pattern items, please see the C API documentation.

Constructors

AllowEmptyClass

Make [] not match anything, rather than counting the ] as the first character of the class.

AltBsux

Like AltBsuxLegacy, except with ECMAScript 6 hex literal feature for \u.

AltBsuxLegacy

Behave like ECMAScript 5 for \U, \u, and \x. See AltBsux.

AltCircumflex

Match a ^ after a newline at the end of the subject. Only relevant in multiline mode.

AltVerbNames

Enable backslash escapes in verb names. E.g., (*MARK:L\(O\)L).

Anchored

Equivalent to beginning pattern with ^.

AutoCallout

Run callout for every pattern item. Only relevant if a callout is set.

BadEscapeIsLiteral

Do not throw an error for unrecognized or malformed escapes. "This is a dangerous option."

Bsr Bsr

Override what \R matches (default given by defaultBsr).

Caseless

Case-insensitive match. Equivalent to (?i).

DepthLimit Word32

Override maximum depth of nested backtracking (default given by defaultDepthLimit). Equivalent to (*LIMIT_DEPTH=number).

DollarEndOnly

Don't match $ with a newline at the end of the subject.

DotAll

A dot also matches a (single-character) newline. Equivalent to (?s).

DupNames

Allow non-unique capture names.

EndAnchored

More or less like ending pattern with $.

EscapedCrIsLf

Interpret \r as \n.

Extended

In the pattern, ignore whitespace, and enable comments starting with #. Equivalent to (?x).

ExtendedMore

Like Extended but also ignore spaces and tabs within [].

FirstLine

The match must begin in the first line of the subject.

HeapLimit Word32

Override maximum heap memory (in kibibytes) used to hold backtracking information (default given by defaultHeapLimit). Equivalent to (*LIMIT_HEAP=number).

Literal

Treat the pattern as a literal string.

MatchLimit Word32

Override maximum value of the main matching loop's internal counter (default given by defaultMatchLimit), as a simple CPU throttle. Equivalent to (*LIMIT_MATCH=number).

MatchLine

Only match complete lines. Equivalent to bracketing the pattern with ^(?:pattern)$.

MatchUnsetBackRef

A backreference to an unset capture group matches an empty string.

MatchWord

Only match subjects that have word boundaries at the beginning and end. Equivalent to bracketing the pattern with \b(?:pattern)\b.

MaxPatternLength Word64

Default is maxBound.

Multiline

^ and $ mean "beginning/end of a line" rather than "beginning/end of the subject". Equivalent to (?m).

NeverBackslashC

Do not allow the unsafe \C sequence.

NeverUcp

Don't count Unicode characters in some character classes such as \d. Overrides (*UCP).

Newline Newline

Override what a newline is (default given by defaultNewline). Equivalent to (*CRLF) or similar.

NoAutoCapture

Disable numbered capturing parentheses.

NoAutoPossess

Turn off some optimizations, possibly resulting in some callouts not being called.

NoDotStarAnchor

Turn off an optimization involving .*, possibly resulting in some callouts not being called.

NoStartOptimize

Turn off some optimizations normally performed at the beginning of a pattern.

NotBol

First character of subject is not the beginning of line. Only affects ^.

NotEmpty

The 0th capture doesn't match if it would be empty.

NotEmptyAtStart

The 0th capture doesn't match if it would be empty and at the beginning of the subject.

NotEol

End of subject is not the end of line. Only affects $.

OffsetLimit Word64

Limit how far an unanchored search can advance in the subject.

ParensLimit Word32

Override max depth of nested parentheses (default given by defaultParensLimit).

PartialHard

If the subject ends without finding a complete match, stop trying alternatives and signal a partial match immediately. Currently we do this by throwing a Pcre2Exception but we should do better.

PartialSoft

If the subject ends and all alternatives have been tried, but no complete match is found, signal a partial match. Currently we do this by throwing a Pcre2Exception but we should do better.

SubGlobal

Affects subOpt. Replace all, rather than just the first.

SubLiteral

Affects subOpt. Treat the replacement as a literal string.

SubReplacementOnly

Affects subOpt. Return just the rendered replacement instead of it within the subject. With SubGlobal, all results are concatenated.

SubUnknownUnset

Affects subOpt. References in the replacement to non-existent captures don't error but are treated as unset.

SubUnsetEmpty

Affects subOpt. References in the replacement to unset captures don't error but are treated as empty.

Ucp

Count Unicode characters in some character classes such as \d. Incompatible with NeverUcp.

Ungreedy

Invert the effect of ?. Without it, quantifiers are non-greedy; with it, they are greedy. Equivalent to (?U).

UnsafeCallout (CalloutInfo -> IO CalloutResult)

Run the given callout at every callout point (see the docs for more info). Multiples of this option before the rightmost are ignored.

NOTE: The callout is run via unsafePerformIO within pure code!

UnsafeCompileRecGuard (Int -> IO Bool)

Run the given guard on every new descent into a level of parentheses, passing the current depth as argument. Returning False aborts pattern compilation with an exception. Multiples of this option before the rightmost are ignored.

NOTE: Currently (PCRE2 version 10.35) patterns seem to be parsed in two passes, both times triggering the recursion guard. Also, it is triggered at the beginning of the pattern, passing 0. None of this is documented; expect the unexpected in the presence of side effects!

NOTE: The guard is run via unsafePerformIO within pure code!

UnsafeSubCallout (SubCalloutInfo -> IO SubCalloutResult)

Run the given callout on every substitution. This is at most once unless SubGlobal is set. Multiples of this option before the rightmost are ignored.

NOTE: The callout is run via unsafePerformIO within pure code!

Utf

Treat both the pattern and subject as UTF rather than fixed-width 16-bit code units.

Instances

Instances details
Semigroup Option Source # 
Instance details

Defined in Text.Regex.Pcre2.Internal

Monoid Option Source # 
Instance details

Defined in Text.Regex.Pcre2.Internal

data Bsr Source #

What \R, backslash R, can mean.

Constructors

BsrUnicode

any Unicode line ending sequence

BsrAnyCrlf

\r, \n, or \r\n

Instances

Instances details
Eq Bsr Source # 
Instance details

Defined in Text.Regex.Pcre2.Internal

Methods

(==) :: Bsr -> Bsr -> Bool #

(/=) :: Bsr -> Bsr -> Bool #

Show Bsr Source # 
Instance details

Defined in Text.Regex.Pcre2.Internal

Methods

showsPrec :: Int -> Bsr -> ShowS #

show :: Bsr -> String #

showList :: [Bsr] -> ShowS #

data Newline Source #

What's considered a newline.

Constructors

NewlineCr

\r only

NewlineLf

\n only

NewlineCrlf

\r\n only

NewlineAny

any Unicode line ending sequence

NewlineAnyCrlf

any of the above

NewlineNul

binary zero

Instances

Instances details
Eq Newline Source # 
Instance details

Defined in Text.Regex.Pcre2.Internal

Methods

(==) :: Newline -> Newline -> Bool #

(/=) :: Newline -> Newline -> Bool #

Show Newline Source # 
Instance details

Defined in Text.Regex.Pcre2.Internal

Callout interface

data CalloutInfo Source #

Input for user-defined callouts.

Constructors

CalloutInfo 

Fields

Instances

Instances details
Eq CalloutInfo Source # 
Instance details

Defined in Text.Regex.Pcre2.Internal

Show CalloutInfo Source # 
Instance details

Defined in Text.Regex.Pcre2.Internal

data CalloutIndex Source #

What caused the callout.

Constructors

CalloutNumber Int

Numerical callout.

CalloutName Text

String callout.

CalloutAuto Int Int

The item located at this half-open range of offsets within the pattern. See AutoCallout.

Instances

Instances details
Eq CalloutIndex Source # 
Instance details

Defined in Text.Regex.Pcre2.Internal

Show CalloutIndex Source # 
Instance details

Defined in Text.Regex.Pcre2.Internal

data CalloutResult Source #

Callout functions return one of these values, which dictates what happens next in the match.

Constructors

CalloutProceed

Keep going.

CalloutNoMatchHere

Fail the current capture, but not the whole match. For example, backtracking may occur.

CalloutNoMatch

Fail the whole match.

Instances

Instances details
Eq CalloutResult Source # 
Instance details

Defined in Text.Regex.Pcre2.Internal

Show CalloutResult Source # 
Instance details

Defined in Text.Regex.Pcre2.Internal

Substitution callouts

data SubCalloutInfo Source #

Input for user-defined substitution callouts.

Constructors

SubCalloutInfo 

Fields

data SubCalloutResult Source #

Substitution callout functions return one of these values, which dictates what happens next in the substitution.

Constructors

SubCalloutAccept

Succeed, and keep going if in global mode.

SubCalloutSkip

Do not perform this substitution, but keep going if in global mode.

SubCalloutAbort

Do not perform this or any subsequent substitutions.

User errors

data Pcre2Exception Source #

Vanilla PCRE2 exceptions with messages generated by the underlying C library.

data Pcre2CompileException Source #

PCRE2 compile exceptions. Along with a message stating the cause, we show the pattern with a cursor pointing at where the error is (if not after the last character).

PCRE2 build configuration

compiledWidths :: [Int] Source #

Which code widths PCRE2 is compiled to operate on. Can be any combination of 8, 16, and 32. Should be [16] but provided here for completeness.

supportsJit :: Bool Source #

Was PCRE2 built with JIT support?

jitTarget :: Maybe Text Source #

A nice description of the CPU architecture JIT support is compiled for, if any.

linkSize :: Int Source #

Number of bytes used for internal linkage in compiled regexes.

defaultTablesLength :: Int Source #

Size in bytes of PCRE2's built-in character processing tables.

unicodeVersion :: Maybe Text Source #

Unicode version string such as 8.0.0, if Unicode is supported at all.

supportsUnicode :: Bool Source #

Was PCRE2 built with Unicode support?

pcreVersion :: Text Source #

Version of the built-in C library. The versioning scheme is that PCRE legacy is 8.x and PCRE2 is 10.x, so this should be 10.something.