pcre2-2.1.1.1: Regular expressions via the PCRE2 C library (included)
Safe HaskellNone
LanguageHaskell2010

Text.Regex.Pcre2

Synopsis

Matching and substitution

Introduction

Expand

Atop the low-level binding to the C API, we present a high-level interface to add regular expressions to Haskell programs.

All input and output strings are strict Text, which maps directly to how PCRE2 operates on strings of 16-bit-wide code units.

The C API requires pattern strings to be compiled and the compiled patterns to be executed on subject strings in discrete steps. We hide this procedure, accepting pattern and subject as arguments in a single function, essentially:

pattern -> subject -> result

The implementation guarantees that, when partially applied to pattern but not subject, the resulting function will close on the underlying compiled object and reuse it for every subject it is subsequently applied to.

Likewise, we do not require the user to know whether a PCRE2 option is to be applied at pattern compile time or match time. Instead we fold all possible options into a single datatype, Option. Most functions have vanilla and configurable variants; the latter have "Opt" in the name and accept a value of this type.

Similar to how head :: [a] -> a sacrifices totality for type simplicity, we represent user errors as imprecise exceptions. Unlike with head, these exceptions are typed (as SomePcre2Exceptions); moreover, we offer Template Haskell facilities that can intercept some of these errors before the program is run. (Failure to match is not considered a user error and is represented by empty; see below.)

There's more than one way to do it with this library. The choices between functions and traversals, poly-kinded Captures and plain lists, string literals and quasi-quotations, quasi-quoted expressions and quasi-quoted patterns...these are left to the user. She will observe that advanced features' extra safety, power, and convenience entail additional language extensions, cognitive overhead, and (for lenses) library dependencies, so it's really a matter of finding the best trade-offs for her case.

Definitions

Expand
Pattern
The string defining a regular expression. Refer to syntax here.
Subject
The string the compiled regular expression is executed on.
Regex
A function of the form Text -> result, where the argument is the subject. It is "compiled" via partial application as discussed above. (Lens users: A regex has the more abstract form Traversal` Text result, but the concept is the same.)
Capture (or capture group)
Any substrings of the subject matched by the pattern, meaning the whole pattern and any parenthesized groupings. The PCRE2 docs do not refer to the former as a "capture"; however it is accessed the same way in the C API, just with index 0, so we will consider it the 0th capture for consistency. Parenthesized captures are implicitly numbered from 1.
Unset capture
A capture considered unset as distinct from empty. This can arise from matching the pattern (a)? to an empty subject—the 0th capture will be set as empty, but the 1st will be unset altogether. We represent both as empty Text for simplicity. See below for discussions about how unset captures may be detected or substituted using this library.
Named capture
A parenthesized capture can be named like this: (?<foo>...). Whether they have names or not, captures are always numbered as described above.

Performance

Expand

Each API function is designed such that, when a regex is obtained, the underlying C data generated from the pattern and any options is reused for that regex's lifetime. Care should be taken that the same regex is not recreated ex nihilo and discarded for each new subject:

isEmptyOrHas2Digits :: Text -> Bool
isEmptyOrHas2Digits s = Text.null s || matches "\\d{2}" s -- bad, fully applied

Instead, store it in a partially applied state:

isEmptyOrHas2Digits = (||) <$> Text.null <*> matches "\\d{2}" -- OK but abstruse

When in doubt, always create regexes as top-level values:

has2Digits :: Text -> Bool
has2Digits = matches "\\d{2}"

isEmptyOrHas2Digits s = Text.null s || has2Digits s -- good

Note: Template Haskell regexes are immune from this problem and may be freely inlined; see below.

Also of note is the optimization that, for each capture that's more than half the length of the subject, a zero-copy Text is produced in constant time and space. This can yield a large performance boost in many cases, for example when splitting lines into key-value pairs as in the teaser. A downside, however, is that retaining these slices in memory will carry the overhead of the dead portions of the subject (still guaranteed to be less than the slices in total size).

Handling results and errors

Expand

In contrast to other APIs where there are separate functions to request single versus global matching, we accomplish this (since 2.0.0) in a unified fashion using the Alternative typeclass. Typically the user will choose from two instances, Maybe and []:

b2 :: (Alternative f) => Text -> f Text
b2 = match "b.."

-- Zero or one match
findB2 :: Text -> Maybe Text
findB2 = b2

-- Zero or more matches
findAllB2s :: Text -> [Text]
findAllB2s = b2

Other instances exist for niche uses, notably STM, that of optparse-applicative, and those of parser combinator libraries such as megaparsec.

By contrast, user errors are thrown purely. If a user error is to be caught, it must be at the site where the match or substitution results are evaluated. As a particular consequence, pattern compile errors are deferred to match sites.

>>> broken = match "*"
>>> :t broken
broken :: Alternative f => Text -> f Text
>>> broken "foo"
*** Exception: pcre2_compile: quantifier does not follow a repeatable item
                    *
                    ^

evaluate comes in handy to force results into the IO monad in order to catch errors reliably:

>>> :set -XTypeApplications
>>> handle @SomePcre2Exception (\_ -> return Nothing) $ evaluate $ broken "foo"
Nothing

Basic matching functions

match :: Alternative f => Text -> Text -> f Text Source #

Match a pattern to a subject and return the portion(s) that matched in an Alternative, or empty if no match.

Since: 2.0.0

matchOpt :: Alternative f => Option -> Text -> Text -> f Text Source #

matchOpt mempty = match

Since: 2.0.0

matches :: Text -> Text -> Bool Source #

Does the pattern match the subject at least once?

matchesOpt :: Option -> Text -> Text -> Bool Source #

matchesOpt mempty = matches

captures :: Alternative f => Text -> Text -> f (NonEmpty Text) Source #

Match a pattern to a subject and return some non-empty list(s) of captures in an Alternative, or empty if no match. The non-empty list constructor :| serves as a cue to differentiate the 0th capture from the others:

let parseDate = captures "(\\d{4})-(\\d{2})-(\\d{2})"
in case parseDate "submitted 2020-10-20" of
    Just (date :| [y, m, d]) -> ...
    Nothing                  -> putStrLn "didn't match"

Since: 2.0.0

capturesOpt :: Alternative f => Option -> Text -> Text -> f (NonEmpty Text) Source #

capturesOpt mempty = captures

Since: 2.0.0

PCRE2-native substitution

sub Source #

Arguments

:: Text

pattern

-> Text

replacement

-> Text

subject

-> Text

result

Perform at most one substitution. See the docs for the special syntax of replacement.

>>> sub "(\\w+) calling the (\\w+)" "$2 calling the $1" "the pot calling the kettle black"
"the kettle calling the pot black"

gsub :: Text -> Text -> Text -> Text Source #

Perform substitutions globally.

>>> gsub "a" "o" "apples and bananas"
"opples ond bononos"

subOpt :: Option -> Text -> Text -> Text -> Text Source #

subOpt mempty = sub
subOpt SubGlobal = gsub

Lens-powered matching and substitution

To use this portion of the library, there are two prerequisites:

  1. A basic working understanding of optics. Many tutorials exist online, such as this, and videos such as this.
  2. A library providing combinators. For lens newcomers, it is recommended to grab microlens-platform—all of the examples in this library work with it, packed and unpacked are included for working with Text, and it is upwards-compatible with the full lens library.

We expose a set of traversals that focus on matched substrings within a subject. Like the basic functional regexes, they should be "compiled" and memoized, rather than created inline.

_nee :: Traversal' Text Text
_nee = _matchOpt (Caseless <> MatchWord) "nee"

In addition to getting results, they support global substitution through setting; more generally, they can accrete effects while performing replacements.

>>> promptNee = traverseOf (_nee . unpacked) $ \s -> print s >> getLine
>>> promptNee "We are the knights who say...NEE!"
"NEE"
NOO
"We are the knights who say...NOO!"
>>> 

In general these traversals are not law-abiding.

_match :: Text -> Traversal' Text Text Source #

Given a pattern, produce a traversal (0 or more targets) that focuses from a subject to the non-overlapping portions of it that match.

Equivalent to _captures patt . ix 0, but more efficient.

_matchOpt :: Option -> Text -> Traversal' Text Text Source #

_matchOpt mempty = _match

_captures :: Text -> Traversal' Text (NonEmpty Text) Source #

Given a pattern, produce a traversal (0 or more targets) that focuses from a subject to each non-empty list of captures that pattern matches.

Substitution works in the following way: If a capture is set such that the new Text is not equal to the old one, a substitution occurs, otherwise it doesn't. This matters in cases where a capture encloses another capture—notably, all parenthesized captures are enclosed by the 0th.

>>> threeAndMiddle = _captures ". (.) ."
>>> "A A A" & threeAndMiddle .~ "A A A" :| ["B"]
"A B A"
>>> "A A A" & threeAndMiddle .~ "A B A" :| ["A"]
"A B A"

Changing multiple overlapping captures won't do what you want and is unsupported.

Changing an unset capture is unsupported because the PCRE2 match API does not give location info about it. Currently we ignore all such attempts. (Native substitution functions like sub do not have this limitation. See also SubUnknownUnset and SubUnsetEmpty.)

If the list becomes longer for some reason, the extra elements are ignored. If it's shortened, the absent elements are considered to be unchanged.

It's recommended that the list be modified capture-wise, using ix.

let madlibs = _captures "(\\w+) my (\\w+)"

print $ "Well bust my buttons!" &~ do
    zoom madlibs $ do
        ix 1 . _head .= 'd'
        ix 2 %= Text.reverse
    _last .= '?'

-- "Well dust my snottub?"

_capturesOpt :: Option -> Text -> Traversal' Text (NonEmpty Text) Source #

_capturesOpt mempty = _captures

Compile-time validation

Despite whatever virtues, the API thus far has some fragility arising from various scenarios:

  • pattern malformation such as mismatched parentheses (runtime error)
  • out-of-bounds indexing of a capture group list (runtime error)
  • out-of-bounds ixing of a Traversal` target (spurious failure to match)
  • case expression containing a Haskell list pattern of the wrong length (spurious failure to match)
  • regex created and discarded inline (suboptimal performance)
  • precariously many backslashes in a pattern. Matching a literal backslash requires the sequence "\\\\"!

Using a combination of language extensions and pattern introspection features, we provide a Template Haskell API to mitigate these scenarios. To make use of it these must be enabled:

ExtensionRequired forWhen
DataKinds Nats (numbers), Symbols (strings), and other type-level data powering compile-time capture lookupsUsing regex/_regex with a pattern containing parenthesized captures
QuasiQuotes[f|...|] syntaxUsing regex/_regex
TypeApplications @i syntax for supplying type index arguments to applicable functions Using regex/_regex with a pattern containing parenthesized captures; using capture/_capture
ViewPatterns Running code and binding variables in pattern context proper (pattern guards are off-limits for this)Using regex as a Haskell pattern

The inspiration for this portion of the library is Ruby, which supports regular expressions with superior ergonomics.

Quasi-quoters

regex :: QuasiQuoter Source #

As an expression

regex :: (Alternative f) => String -> Text -> f (Captures info)

in the presence of parenthesized captures, or

regex :: (Alternative f) => String -> Text -> f Text

if there are none. In other words, if there is more than the 0th capture, this behaves like captures (except returning an opaque Captures instead of a NonEmpty list), otherwise it behaves like match.

To retrieve an individual capture from a Captures, use capture.

case [regex|(?<y>\d{4})-(?<m>\d{2})-(?<d>\d{2})|] "submitted 2020-10-20" of
    Just cs ->
        let date = capture @0 cs
            year = read @Int $ Text.unpack $ capture @"y" cs
            ...
forM_ ([regex|\s+$|] line :: Maybe Text) $ \spaces -> error $
    "line has trailing spaces (" ++ show (Text.length spaces) ++ " characters)"

As a pattern

This matches when the regex first matches, whereupon any named captures are bound to variables of the same names.

case "submitted 2020-10-20" of
    [regex|(?<y>\d{4})-(?<m>\d{2})-(?<d>\d{2})|] ->
        let year = read @Int $ Text.unpack y
            ...

Note that it is not possible to access the 0th capture this way. As a workaround, explicitly capture the whole pattern and name it.

If there are no named captures, this simply acts as a guard.

_regex :: QuasiQuoter Source #

An optical variant of regex/a type-annotated variant of _captures. Can only be used as an expression.

_regex :: String -> Traversal` Text (Captures info)
_regex :: String -> Traversal' Text Text
embeddedNumbers :: Traversal' String Int
embeddedNumbers = packed . [_regex|\d+|] . unpacked . _Show

main :: IO ()
main = putStrLn $ "There are 14 competing standards" & embeddedNumbers %~ (+ 1)

-- There are 15 competing standards

Type-indexed capture groups

data Captures (info :: CapturesInfo) Source #

A wrapper around a list of captures that carries additional type-level information about the number and names of those captures.

This type is only intended to be created by regex/_regex and consumed by capture/_capture, relying on type inference. Specifying the info explicitly in a type signature is not supported—the definition of CapturesInfo is not part of the public API and may change without warning.

After obtaining Captures it's recommended to immediately consume them and transform them into application-level data, to avoid leaking the types.

Instances

Instances details
Show (Captures info) Source #

Since: 2.0.4

Instance details

Defined in Text.Regex.Pcre2.TH

Methods

showsPrec :: Int -> Captures info -> ShowS #

show :: Captures info -> String #

showList :: [Captures info] -> ShowS #

capture :: forall i info num. (CaptNum i info ~ num, KnownNat num) => Captures info -> Text Source #

Safely lookup a capture in a Captures result obtained from a Template Haskell-generated matching function.

The ugly type signature may be interpreted like this: Given some capture group index i and some info about a regex, ensure that index exists and is resolved to the number num at compile time. Then, at runtime, get a capture group (numbered num) from a list of (at least num) captures.

In practice the variable i is specified by type application and the other variables are inferred.

capture @3
capture @"bar"

Specifying a nonexistent number or name will result in a type error.

_capture :: forall i info num. (CaptNum i info ~ num, KnownNat num) => Lens' (Captures info) Text Source #

Like capture but focus from a Captures to a capture.

Options

data Option Source #

A Monoid representing nearly every facility PCRE2 presents for tweaking the behavior of regex compilation and execution.

All library functions that take options have the suffix Opt in their names; for each of them, there's also a non-Opt convenience function that simply has the (unexported) mempty option. For many uses, options won't be needed.

Some options can be enabled by special character sequences in the pattern as an alternative to specifying them as an Option. See Caseless for example.

Most options are exported in Text.Regex.Pcre2. The callout interface is found in Text.Regex.Pcre2.Unsafe.

Documentation is scant here. For more complete, accurate information, including discussions of corner cases arising from specific combinations of options and pattern items, please see the C API documentation.

Constructors

AllowEmptyClass

Make [] not match anything, rather than counting the ] as the first character of the class.

AltBsux

Like AltBsuxLegacy, except with ECMAScript 6 hex literal feature for \u.

AltBsuxLegacy

Behave like ECMAScript 5 for \U, \u, and \x. See AltBsux.

AltCircumflex

Match a ^ after a newline at the end of the subject. Only relevant in multiline mode.

AltVerbNames

Enable backslash escapes in verb names. E.g., (*MARK:L\(O\)L).

Anchored

Equivalent to beginning pattern with ^.

Bsr Bsr

Override what \R matches (default given by defaultBsr).

Caseless

Case-insensitive match. Equivalent to (?i).

DepthLimit Word32

Override maximum depth of nested backtracking (default given by defaultDepthLimit). Equivalent to (*LIMIT_DEPTH=number).

DollarEndOnly

Don't match $ with a newline at the end of the subject.

DotAll

A dot also matches a (single-character) newline. Equivalent to (?s).

EndAnchored

More or less like ending pattern with $.

EscapedCrIsLf

Interpret \r as \n.

Extended

In the pattern, ignore whitespace, and enable comments starting with #. Equivalent to (?x).

ExtendedMore

Like Extended but also ignore spaces and tabs within [].

FirstLine

The match must begin in the first line of the subject.

HeapLimit Word32

Override maximum heap memory (in kibibytes) used to hold backtracking information (default given by defaultHeapLimit). Equivalent to (*LIMIT_HEAP=number).

Literal

Treat the pattern as a literal string.

MatchLimit Word32

Override maximum value of the main matching loop's internal counter (default given by defaultMatchLimit), as a simple CPU throttle. Equivalent to (*LIMIT_MATCH=number).

MatchLine

Only match complete lines. Equivalent to bracketing the pattern with ^(?:pattern)$.

MatchUnsetBackRef

A backreference to an unset capture group matches an empty string.

MatchWord

Only match subjects that have word boundaries at the beginning and end. Equivalent to bracketing the pattern with \b(?:pattern)\b.

MaxPatternLength Word64

Default is maxBound.

Multiline

^ and $ mean "beginning/end of a line" rather than "beginning/end of the subject". Equivalent to (?m).

NeverBackslashC

Do not allow the unsafe \C sequence.

NeverUcp

Don't count Unicode characters in some character classes such as \d. Overrides (*UCP).

Newline Newline

Override what a newline is (default given by defaultNewline). Equivalent to (*CRLF) or similar.

NoAutoCapture

Disable numbered capturing parentheses.

NoAutoPossess

Turn off some optimizations, possibly resulting in some callouts not being called.

NoDotStarAnchor

Turn off an optimization involving .*, possibly resulting in some callouts not being called.

NoStartOptimize

Turn off some optimizations normally performed at the beginning of a pattern.

NotBol

First character of subject is not the beginning of line. Only affects ^.

NotEmpty

The 0th capture doesn't match if it would be empty.

NotEmptyAtStart

The 0th capture doesn't match if it would be empty and at the beginning of the subject.

NotEol

End of subject is not the end of line. Only affects $.

OffsetLimit Word64

Limit how far an unanchored search can advance in the subject.

ParensLimit Word32

Override max depth of nested parentheses (default given by defaultParensLimit).

PartialHard

If the subject ends without finding a complete match, stop trying alternatives and signal a partial match immediately. Currently we do this by throwing a Pcre2Exception but we should do better.

PartialSoft

If the subject ends and all alternatives have been tried, but no complete match is found, signal a partial match. Currently we do this by throwing a Pcre2Exception but we should do better.

SubGlobal

Affects subOpt. Replace all, rather than just the first.

SubLiteral

Affects subOpt. Treat the replacement as a literal string.

SubReplacementOnly

Affects subOpt. Return just the rendered replacement instead of it within the subject. With SubGlobal, all results are concatenated.

SubUnknownUnset

Affects subOpt. References in the replacement to non-existent captures don't error but are treated as unset.

SubUnsetEmpty

Affects subOpt. References in the replacement to unset captures don't error but are treated as empty.

Ucp

Count Unicode characters in some character classes such as \d. Incompatible with NeverUcp.

Ungreedy

Invert the effect of ?. Without it, quantifiers are non-greedy; with it, they are greedy. Equivalent to (?U).

Instances

Instances details
Semigroup Option Source # 
Instance details

Defined in Text.Regex.Pcre2.Internal

Monoid Option Source # 
Instance details

Defined in Text.Regex.Pcre2.Internal

data Bsr Source #

What \R, backslash R, can mean.

Constructors

BsrUnicode

any Unicode line ending sequence

BsrAnyCrlf

\r, \n, or \r\n

Instances

Instances details
Eq Bsr Source # 
Instance details

Defined in Text.Regex.Pcre2.Internal

Methods

(==) :: Bsr -> Bsr -> Bool #

(/=) :: Bsr -> Bsr -> Bool #

Show Bsr Source # 
Instance details

Defined in Text.Regex.Pcre2.Internal

Methods

showsPrec :: Int -> Bsr -> ShowS #

show :: Bsr -> String #

showList :: [Bsr] -> ShowS #

data Newline Source #

What's considered a newline.

Constructors

NewlineCr

\r only

NewlineLf

\n only

NewlineCrlf

\r\n only

NewlineAny

any Unicode line ending sequence

NewlineAnyCrlf

any of the above

NewlineNul

binary zero

Instances

Instances details
Eq Newline Source # 
Instance details

Defined in Text.Regex.Pcre2.Internal

Methods

(==) :: Newline -> Newline -> Bool #

(/=) :: Newline -> Newline -> Bool #

Show Newline Source # 
Instance details

Defined in Text.Regex.Pcre2.Internal

User errors

data Pcre2Exception Source #

Vanilla PCRE2 exceptions with messages generated by the underlying C library.

data Pcre2CompileException Source #

PCRE2 compile exceptions. Along with a message stating the cause, we show the pattern with a cursor pointing at where the error is (if not after the last character).

PCRE2 build configuration

compiledWidths :: [Int] Source #

Which code widths PCRE2 is compiled to operate on. Can be any combination of 8, 16, and 32. Should be [16] but provided here for completeness.

supportsJit :: Bool Source #

Was PCRE2 built with JIT support?

jitTarget :: Maybe Text Source #

A nice description of the CPU architecture JIT support is compiled for, if any.

linkSize :: Int Source #

Number of bytes used for internal linkage in compiled regexes.

defaultTablesLength :: Int Source #

Size in bytes of PCRE2's built-in character processing tables.

unicodeVersion :: Maybe Text Source #

Unicode version string such as 8.0.0, if Unicode is supported at all.

supportsUnicode :: Bool Source #

Was PCRE2 built with Unicode support?

pcreVersion :: Text Source #

Version of the built-in C library. The versioning scheme is that PCRE legacy is 8.x and PCRE2 is 10.x, so this should be 10.something.