Safe Haskell | None |
---|---|
Language | Haskell2010 |
Synopsis
- match :: Alternative f => Text -> Text -> f Text
- matchOpt :: Alternative f => Option -> Text -> Text -> f Text
- matches :: Text -> Text -> Bool
- matchesOpt :: Option -> Text -> Text -> Bool
- captures :: Alternative f => Text -> Text -> f (NonEmpty Text)
- capturesOpt :: Alternative f => Option -> Text -> Text -> f (NonEmpty Text)
- sub :: Text -> Text -> Text -> Text
- gsub :: Text -> Text -> Text -> Text
- subOpt :: Option -> Text -> Text -> Text -> Text
- _match :: Text -> Traversal' Text Text
- _matchOpt :: Option -> Text -> Traversal' Text Text
- _captures :: Text -> Traversal' Text (NonEmpty Text)
- _capturesOpt :: Option -> Text -> Traversal' Text (NonEmpty Text)
- regex :: QuasiQuoter
- _regex :: QuasiQuoter
- data Captures (info :: CapturesInfo)
- capture :: forall i info num. (CaptNum i info ~ num, KnownNat num) => Captures info -> Text
- _capture :: forall i info num. (CaptNum i info ~ num, KnownNat num) => Lens' (Captures info) Text
- data Option
- = AllowEmptyClass
- | AltBsux
- | AltBsuxLegacy
- | AltCircumflex
- | AltVerbNames
- | Anchored
- | Bsr Bsr
- | Caseless
- | DepthLimit Word32
- | DollarEndOnly
- | DotAll
- | EndAnchored
- | EscapedCrIsLf
- | Extended
- | ExtendedMore
- | FirstLine
- | HeapLimit Word32
- | Literal
- | MatchLimit Word32
- | MatchLine
- | MatchUnsetBackRef
- | MatchWord
- | MaxPatternLength Word64
- | Multiline
- | NeverBackslashC
- | NeverUcp
- | Newline Newline
- | NoAutoCapture
- | NoAutoPossess
- | NoDotStarAnchor
- | NoStartOptimize
- | NotBol
- | NotEmpty
- | NotEmptyAtStart
- | NotEol
- | OffsetLimit Word64
- | ParensLimit Word32
- | PartialHard
- | PartialSoft
- | SubGlobal
- | SubLiteral
- | SubReplacementOnly
- | SubUnknownUnset
- | SubUnsetEmpty
- | Ucp
- | Ungreedy
- data Bsr
- data Newline
- data SomePcre2Exception
- data Pcre2Exception
- data Pcre2CompileException
- defaultBsr :: Bsr
- compiledWidths :: [Int]
- defaultDepthLimit :: Int
- defaultHeapLimit :: Int
- supportsJit :: Bool
- jitTarget :: Maybe Text
- linkSize :: Int
- defaultMatchLimit :: Int
- defaultNewline :: Newline
- defaultIsNeverBackslashC :: Bool
- defaultParensLimit :: Int
- defaultTablesLength :: Int
- unicodeVersion :: Maybe Text
- supportsUnicode :: Bool
- pcreVersion :: Text
Matching and substitution
Introduction
Atop the low-level binding to the C API, we present a high-level interface to add regular expressions to Haskell programs.
All input and output strings are strict Text
, which maps
directly to how PCRE2 operates on strings of 16-bit-wide code units.
The C API requires pattern strings to be compiled and the compiled patterns to be executed on subject strings in discrete steps. We hide this procedure, accepting pattern and subject as arguments in a single function, essentially:
pattern -> subject -> result
The implementation guarantees that, when partially applied to pattern but not subject, the resulting function will close on the underlying compiled object and reuse it for every subject it is subsequently applied to.
Likewise, we do not require the user to know whether a PCRE2 option is to be
applied at pattern compile time or match time. Instead we fold all possible
options into a single datatype, Option
. Most functions have vanilla and
configurable variants; the latter have "Opt
" in the name and accept a
value of this type.
Similar to how head :: [a] -> a
sacrifices totality for type simplicity,
we represent user errors as imprecise exceptions. Unlike with head
, these
exceptions are typed (as SomePcre2Exception
s); moreover, we offer Template
Haskell facilities that can intercept some of these errors before the
program is run. (Failure to match is not considered a user error and is
represented by empty
; see below.)
There's more than one way to do it with this library. The choices between functions and traversals,
poly-kinded Captures
and plain lists, string literals and
quasi-quotations, quasi-quoted expressions and quasi-quoted patterns...these
are left to the user. She will observe that advanced features' extra
safety, power, and convenience entail additional language extensions,
cognitive overhead, and (for lenses) library dependencies, so it's really a
matter of finding the best trade-offs for her case.
Definitions
- Pattern
- The string defining a regular expression. Refer to syntax here.
- Subject
- The string the compiled regular expression is executed on.
- Regex
- A function of the form
, where the argument is the subject. It is "compiled" via partial application as discussed above. (Lens users: A regex has the more abstract formText
-> result
, but the concept is the same.)Traversal
`Text
result - Capture (or capture group)
- Any substrings of the subject matched by the pattern, meaning the whole pattern and any parenthesized groupings. The PCRE2 docs do not refer to the former as a "capture"; however it is accessed the same way in the C API, just with index 0, so we will consider it the 0th capture for consistency. Parenthesized captures are implicitly numbered from 1.
- Unset capture
- A capture considered unset as distinct from empty. This
can arise from matching the pattern
(a)?
to an empty subject—the 0th capture will be set as empty, but the 1st will be unset altogether. We represent both as emptyText
for simplicity. See below for discussions about how unset captures may be detected or substituted using this library. - Named capture
- A parenthesized capture can be named like this:
(?<foo>...)
. Whether they have names or not, captures are always numbered as described above.
Performance
Each API function is designed such that, when a regex is obtained, the underlying C data generated from the pattern and any options is reused for that regex's lifetime. Care should be taken that the same regex is not recreated ex nihilo and discarded for each new subject:
isEmptyOrHas2Digits :: Text -> Bool isEmptyOrHas2Digits s = Text.null s || matches "\\d{2}" s -- bad, fully applied
Instead, store it in a partially applied state:
isEmptyOrHas2Digits = (||) <$> Text.null <*> matches "\\d{2}" -- OK but abstruse
When in doubt, always create regexes as top-level values:
has2Digits :: Text -> Bool has2Digits = matches "\\d{2}" isEmptyOrHas2Digits s = Text.null s || has2Digits s -- good
Note: Template Haskell regexes are immune from this problem and may be freely inlined; see below.
Also of note is the optimization that, for each capture that's more than
half the length of the subject, a zero-copy Text
is produced in
constant time and space. This can yield a large performance boost in many
cases, for example when splitting lines into key-value pairs as in
the teaser. A downside,
however, is that retaining these slices in memory will carry the overhead
of the dead portions of the subject (still guaranteed to be less than the
slices in total size).
Handling results and errors
In contrast to other
APIs
where there are separate functions to request single versus global matching,
we accomplish this (since 2.0.0) in a unified fashion using the
Alternative
typeclass. Typically the user will
choose from two instances, Maybe
and []
:
b2 :: (Alternative f) => Text -> f Text b2 = match "b.." -- Zero or one match findB2 :: Text -> Maybe Text findB2 = b2 -- Zero or more matches findAllB2s :: Text -> [Text] findAllB2s = b2
Other instances exist for niche uses,
notably STM
,
that of optparse-applicative,
and those of parser combinator libraries such
as megaparsec.
By contrast, user errors are thrown purely. If a user error is to be caught, it must be at the site where the match or substitution results are evaluated. As a particular consequence, pattern compile errors are deferred to match sites.
>>>
broken = match "*"
>>>
:t broken
broken :: Alternative f => Text -> f Text>>>
broken "foo"
*** Exception: pcre2_compile: quantifier does not follow a repeatable item * ^
evaluate
comes in handy to force results into the IO
monad in order to catch errors reliably:
>>>
:set -XTypeApplications
>>>
handle @SomePcre2Exception (\_ -> return Nothing) $ evaluate $ broken "foo"
Nothing
Basic matching functions
match :: Alternative f => Text -> Text -> f Text Source #
Match a pattern to a subject and return the portion(s) that matched in an
Alternative
, or empty
if no match.
Since: 2.0.0
matchOpt :: Alternative f => Option -> Text -> Text -> f Text Source #
matchOpt mempty = match
Since: 2.0.0
captures :: Alternative f => Text -> Text -> f (NonEmpty Text) Source #
Match a pattern to a subject and return some non-empty list(s) of captures
in an Alternative
, or empty
if no match. The non-empty list constructor
:|
serves as a cue to differentiate the 0th capture from the others:
let parseDate = captures "(\\d{4})-(\\d{2})-(\\d{2})" in case parseDate "submitted 2020-10-20" of Just (date :| [y, m, d]) -> ... Nothing -> putStrLn "didn't match"
Since: 2.0.0
capturesOpt :: Alternative f => Option -> Text -> Text -> f (NonEmpty Text) Source #
capturesOpt mempty = captures
Since: 2.0.0
PCRE2-native substitution
Perform at most one substitution. See the docs for the special syntax of replacement.
>>>
sub "(\\w+) calling the (\\w+)" "$2 calling the $1" "the pot calling the kettle black"
"the kettle calling the pot black"
gsub :: Text -> Text -> Text -> Text Source #
Perform substitutions globally.
>>>
gsub "a" "o" "apples and bananas"
"opples ond bononos"
subOpt :: Option -> Text -> Text -> Text -> Text Source #
subOpt mempty = sub subOpt SubGlobal = gsub
Lens-powered matching and substitution
To use this portion of the library, there are two prerequisites:
- A basic working understanding of optics. Many tutorials exist online, such as this, and videos such as this.
- A library providing combinators. For lens newcomers, it is
recommended to grab
microlens-platform—all
of the examples in this library work with it,
packed
andunpacked
are included for working withText
, and it is upwards-compatible with the full lens library.
We expose a set of traversals that focus on matched substrings within a subject. Like the basic functional regexes, they should be "compiled" and memoized, rather than created inline.
_nee :: Traversal' Text Text _nee = _matchOpt (Caseless <> MatchWord) "nee"
In addition to getting results, they support global substitution through setting; more generally, they can accrete effects while performing replacements.
>>>
promptNee = traverseOf (_nee . unpacked) $ \s -> print s >> getLine
>>>
promptNee "We are the knights who say...NEE!"
"NEE" NOO "We are the knights who say...NOO!">>>
In general these traversals are not law-abiding.
_captures :: Text -> Traversal' Text (NonEmpty Text) Source #
Given a pattern, produce a traversal (0 or more targets) that focuses from a subject to each non-empty list of captures that pattern matches.
Substitution works in the following way: If a capture is set such that the
new Text
is not equal to the old one, a substitution occurs, otherwise it
doesn't. This matters in cases where a capture encloses another
capture—notably, all parenthesized captures are enclosed by the 0th.
>>>
threeAndMiddle = _captures ". (.) ."
>>>
"A A A" & threeAndMiddle .~ "A A A" :| ["B"]
"A B A">>>
"A A A" & threeAndMiddle .~ "A B A" :| ["A"]
"A B A"
Changing multiple overlapping captures won't do what you want and is unsupported.
Changing an unset capture is unsupported because the PCRE2 match API does not
give location info about it. Currently we ignore all such attempts. (Native
substitution functions like sub
do not have this limitation. See also
SubUnknownUnset
and SubUnsetEmpty
.)
If the list becomes longer for some reason, the extra elements are ignored. If it's shortened, the absent elements are considered to be unchanged.
It's recommended that the list be modified capture-wise, using ix
.
let madlibs = _captures "(\\w+) my (\\w+)" print $ "Well bust my buttons!" &~ do zoom madlibs $ do ix 1 . _head .= 'd' ix 2 %= Text.reverse _last .= '?' -- "Well dust my snottub?"
_capturesOpt :: Option -> Text -> Traversal' Text (NonEmpty Text) Source #
_capturesOpt mempty = _captures
Compile-time validation
Despite whatever virtues, the API thus far has some fragility arising from various scenarios:
- pattern malformation such as mismatched parentheses (runtime error)
- out-of-bounds indexing of a capture group list (runtime error)
- out-of-bounds
ix
ing of aTraversal
` target (spurious failure to match) - case expression containing a Haskell list pattern of the wrong length (spurious failure to match)
- regex created and discarded inline (suboptimal performance)
- precariously many backslashes in a pattern. Matching a literal
backslash requires the sequence
"\\\\"
!
Using a combination of language extensions and pattern introspection features, we provide a Template Haskell API to mitigate these scenarios. To make use of it these must be enabled:
Extension | Required for | When |
---|---|---|
DataKinds
| Nat s (numbers),
Symbol s (strings), and
other type-level data powering
compile-time capture lookups | Using regex /_regex with
a pattern containing
parenthesized captures
|
QuasiQuotes | [ f| ...|] syntax | Using regex /_regex |
TypeApplications
| @i syntax for supplying type index
arguments to applicable functions
| Using regex /_regex with
a pattern containing
parenthesized captures;
using capture /_capture |
ViewPatterns
| Running code and binding variables in pattern context proper (pattern guards are off-limits for this) | Using regex as a Haskell
pattern
|
The inspiration for this portion of the library is Ruby, which supports regular expressions with superior ergonomics.
Quasi-quoters
regex :: QuasiQuoter Source #
As an expression
regex :: (Alternative
f) => String -> Text -> f (Captures
info)
in the presence of parenthesized captures, or
regex :: (Alternative f) => String -> Text -> f Text
if there are none. In other words, if there is more than the 0th capture,
this behaves like captures
(except returning an opaque Captures
instead
of a NonEmpty
list), otherwise it behaves like match
.
To retrieve an individual capture from a Captures
, use capture
.
case [regex|(?<y>\d{4})-(?<m>\d{2})-(?<d>\d{2})|] "submitted 2020-10-20" of Just cs -> let date = capture @0 cs year = read @Int $ Text.unpack $ capture @"y" cs ...
forM_ ([regex|\s+$|] line :: Maybe Text) $ \spaces -> error $ "line has trailing spaces (" ++ show (Text.length spaces) ++ " characters)"
As a pattern
This matches when the regex first matches, whereupon any named captures are bound to variables of the same names.
case "submitted 2020-10-20" of [regex|(?<y>\d{4})-(?<m>\d{2})-(?<d>\d{2})|] -> let year = read @Int $ Text.unpack y ...
Note that it is not possible to access the 0th capture this way. As a workaround, explicitly capture the whole pattern and name it.
If there are no named captures, this simply acts as a guard.
_regex :: QuasiQuoter Source #
An optical variant of regex
/a type-annotated variant of _captures
. Can
only be used as an expression.
_regex :: String ->Traversal
` Text (Captures
info) _regex :: String -> Traversal' Text Text
embeddedNumbers :: Traversal' String Int embeddedNumbers = packed . [_regex|\d+|] . unpacked . _Show main :: IO () main = putStrLn $ "There are 14 competing standards" & embeddedNumbers %~ (+ 1) -- There are 15 competing standards
Type-indexed capture groups
data Captures (info :: CapturesInfo) Source #
A wrapper around a list of captures that carries additional type-level information about the number and names of those captures.
This type is only intended to be created by regex
/_regex
and consumed by
capture
/_capture
, relying on type inference. Specifying the info
explicitly in a type signature is not supported—the definition of
CapturesInfo
is not part of the public API and may change without warning.
After obtaining Captures
it's recommended to immediately consume them and
transform them into application-level data, to avoid leaking the types.
capture :: forall i info num. (CaptNum i info ~ num, KnownNat num) => Captures info -> Text Source #
Safely lookup a capture in a Captures
result obtained from a Template
Haskell-generated matching function.
The ugly type signature may be interpreted like this: Given some capture
group index i
and some info
about a regex, ensure that index exists and
is resolved to the number num
at compile time. Then, at runtime, get a
capture group (numbered num
) from a list of (at least num
) captures.
In practice the variable i
is specified by type application and the other
variables are inferred.
capture @3 capture @"bar"
Specifying a nonexistent number or name will result in a type error.
_capture :: forall i info num. (CaptNum i info ~ num, KnownNat num) => Lens' (Captures info) Text Source #
Options
A Monoid
representing nearly every facility PCRE2 presents for tweaking
the behavior of regex compilation and execution.
All library functions that take options have the suffix Opt
in their names;
for each of them, there's also a non-Opt
convenience function that simply
has the (unexported) mempty
option. For many uses, options won't be
needed.
Some options can be enabled by special character sequences in the pattern as
an alternative to specifying them as an Option
. See Caseless
for
example.
Most options are exported in Text.Regex.Pcre2. The callout interface is found in Text.Regex.Pcre2.Unsafe.
Documentation is scant here. For more complete, accurate information, including discussions of corner cases arising from specific combinations of options and pattern items, please see the C API documentation.
AllowEmptyClass | Make |
AltBsux | Like |
AltBsuxLegacy | Behave like ECMAScript 5 for |
AltCircumflex | Match a |
AltVerbNames | Enable backslash escapes in verb names. E.g.,
|
Anchored | Equivalent to beginning pattern with |
Bsr Bsr | Override what |
Caseless | Case-insensitive match. Equivalent to |
DepthLimit Word32 | Override maximum depth of nested backtracking
(default given by |
DollarEndOnly | Don't match |
DotAll | A dot also matches a (single-character) newline. Equivalent
to |
EndAnchored | More or less like ending pattern with |
EscapedCrIsLf | Interpret |
Extended | In the pattern, ignore whitespace, and enable comments
starting with |
ExtendedMore | Like |
FirstLine | The match must begin in the first line of the subject. |
HeapLimit Word32 | Override maximum heap memory (in kibibytes) used to
hold backtracking information (default given by |
Literal | Treat the pattern as a literal string. |
MatchLimit Word32 | Override maximum value of the main matching loop's
internal counter (default given by |
MatchLine | Only match complete lines. Equivalent to bracketing the
pattern with |
MatchUnsetBackRef | A backreference to an unset capture group matches an empty string. |
MatchWord | Only match subjects that have word boundaries at the
beginning and end. Equivalent to bracketing the pattern with
|
MaxPatternLength Word64 | Default is |
Multiline |
|
NeverBackslashC | Do not allow the unsafe |
NeverUcp | Don't count Unicode characters in some character classes
such as |
Newline Newline | Override what a newline is (default given by
|
NoAutoCapture | Disable numbered capturing parentheses. |
NoAutoPossess | Turn off some optimizations, possibly resulting in some callouts not being called. |
NoDotStarAnchor | Turn off an optimization involving |
NoStartOptimize | Turn off some optimizations normally performed at the beginning of a pattern. |
NotBol | First character of subject is not the beginning of
line. Only affects |
NotEmpty | The 0th capture doesn't match if it would be empty. |
NotEmptyAtStart | The 0th capture doesn't match if it would be empty and at the beginning of the subject. |
NotEol | End of subject is not the end of line. Only
affects |
OffsetLimit Word64 | Limit how far an unanchored search can advance in the subject. |
ParensLimit Word32 | Override max depth of nested parentheses (default
given by |
PartialHard | If the subject ends without finding a complete match,
stop trying alternatives and signal a partial match immediately.
Currently we do this by throwing a |
PartialSoft | If the subject ends and all alternatives have been tried,
but no complete match is found, signal a partial match. Currently we do
this by throwing a |
SubGlobal | Affects |
SubLiteral | Affects |
SubReplacementOnly | Affects |
SubUnknownUnset | Affects |
SubUnsetEmpty | Affects |
Ucp | Count Unicode characters in some character classes such as |
Ungreedy | Invert the effect of |
What \R
, backslash R, can mean.
BsrUnicode | any Unicode line ending sequence |
BsrAnyCrlf |
|
What's considered a newline.
NewlineCr |
|
NewlineLf |
|
NewlineCrlf |
|
NewlineAny | any Unicode line ending sequence |
NewlineAnyCrlf | any of the above |
NewlineNul | binary zero |
User errors
data SomePcre2Exception Source #
The root of the PCRE2 exception hierarchy.
Instances
Show SomePcre2Exception Source # | |
Defined in Text.Regex.Pcre2.Internal showsPrec :: Int -> SomePcre2Exception -> ShowS # show :: SomePcre2Exception -> String # showList :: [SomePcre2Exception] -> ShowS # | |
Exception SomePcre2Exception Source # | |
Defined in Text.Regex.Pcre2.Internal |
data Pcre2Exception Source #
Vanilla PCRE2 exceptions with messages generated by the underlying C library.
Instances
Show Pcre2Exception Source # | |
Defined in Text.Regex.Pcre2.Internal showsPrec :: Int -> Pcre2Exception -> ShowS # show :: Pcre2Exception -> String # showList :: [Pcre2Exception] -> ShowS # | |
Exception Pcre2Exception Source # | |
Defined in Text.Regex.Pcre2.Internal |
data Pcre2CompileException Source #
PCRE2 compile exceptions. Along with a message stating the cause, we show the pattern with a cursor pointing at where the error is (if not after the last character).
Instances
Show Pcre2CompileException Source # | |
Defined in Text.Regex.Pcre2.Internal showsPrec :: Int -> Pcre2CompileException -> ShowS # show :: Pcre2CompileException -> String # showList :: [Pcre2CompileException] -> ShowS # | |
Exception Pcre2CompileException Source # | |
PCRE2 build configuration
defaultBsr :: Bsr Source #
See Bsr
.
compiledWidths :: [Int] Source #
Which code widths PCRE2 is compiled to operate on. Can be any combination
of 8, 16, and 32. Should be [16]
but provided here for completeness.
defaultDepthLimit :: Int Source #
See DepthLimit
.
defaultHeapLimit :: Int Source #
See HeapLimit
.
supportsJit :: Bool Source #
Was PCRE2 built with JIT support?
jitTarget :: Maybe Text Source #
A nice description of the CPU architecture JIT support is compiled for, if any.
defaultMatchLimit :: Int Source #
See MatchLimit
.
defaultNewline :: Newline Source #
See Newline
.
defaultParensLimit :: Int Source #
See ParensLimit
.
defaultTablesLength :: Int Source #
Size in bytes of PCRE2's built-in character processing tables.
unicodeVersion :: Maybe Text Source #
Unicode version string such as 8.0.0
, if Unicode is supported at all.
supportsUnicode :: Bool Source #
Was PCRE2 built with Unicode support?
pcreVersion :: Text Source #
Version of the built-in C library. The versioning scheme is that PCRE
legacy is 8.x and PCRE2 is 10.x, so this should be 10.
something.