text-icu-0.6.3.4: Bindings to the ICU library

PortabilityGHC
Stabilityexperimental
Maintainerbos@serpentine.com

Data.Text.ICU.Regex

Contents

Description

Regular expression support for Unicode, implemented as bindings to the International Components for Unicode (ICU) libraries.

The syntax and behaviour of ICU regular expressions are Perl-like. For complete details, see the ICU User Guide entry at http://userguide.icu-project.org/strings/regexp.

Note: The functions in this module are not thread safe. For thread safe use, see clone below, or use the pure functions in Data.Text.ICU.

Synopsis

Types

data MatchOption Source

Options for controlling matching behaviour.

Constructors

CaseInsensitive

Enable case insensitive matching.

Comments

Allow comments and white space within patterns.

DotAll

If set, '.' matches line terminators. Otherwise '.' matching stops at line end.

Literal

If set, treat the entire pattern as a literal string. Metacharacters or escape sequences in the input sequence will be given no special meaning.

The option CaseInsensitive retains its meanings on matching when used in conjunction with this option. Other options become superfluous.

Multiline

Control behaviour of '$' and '^'. If set, recognize line terminators within string, Otherwise, match only at start and end of input string.

HaskellLines

Haskell-only line endings. When this mode is enabled, only '\n' is recognized as a line ending in the behavior of '.', '^', and '$'.

UnicodeWord

Unicode word boundaries. If set, '\\b' uses the Unicode TR 29 definition of word boundaries.

Warning: Unicode word boundaries are quite different from traditional regular expression word boundaries. See http://unicode.org/reports/tr29/#Word_Boundaries.

ErrorOnUnknownEscapes

Throw an error on unrecognized backslash escapes. If set, fail with an error on patterns that contain backslash-escaped ASCII letters without a known special meaning. If this flag is not set, these escaped letters represent themselves.

WorkLimit Int

Set a processing limit for match operations.

Some patterns, when matching certain strings, can run in exponential time. For practical purposes, the match operation may appear to be in an infinite loop. When a limit is set a match operation will fail with an error if the limit is exceeded.

The units of the limit are steps of the match engine. Correspondence with actual processor time will depend on the speed of the processor and the details of the specific pattern, but will typically be on the order of milliseconds.

By default, the matching time is not limited.

StackLimit Int

Set the amount of heap storage avaliable for use by the match backtracking stack.

ICU uses a backtracking regular expression engine, with the backtrack stack maintained on the heap. This function sets the limit to the amount of memory that can be used for this purpose. A backtracking stack overflow will result in an error from the match operation that caused it.

A limit is desirable because a malicious or poorly designed pattern can use excessive memory, potentially crashing the process. A limit is enabled by default.

data ParseError Source

Detailed information about parsing errors. Used by ICU parsing engines that parse long rules, patterns, or programs, where the text being parsed is long enough that more information than an ICUError is needed to localize the error.

data Regex Source

A compiled regular expression.

Regex values are usually constructed using the regex or regex' functions. This type is also an instance of IsString, so if you have the OverloadedStrings language extension enabled, you can construct a Regex by simply writing the pattern in quotes (though this does not allow you to specify any Options).

Instances

Functions

Construction

regex :: [MatchOption] -> Text -> IO RegexSource

Compile a regular expression with the given options. This function throws a ParseError if the pattern is invalid.

The Regex is initialized with empty text to search against.

regex' :: [MatchOption] -> Text -> IO (Either ParseError Regex)Source

Compile a regular expression with the given options. This is safest to use when the pattern is constructed at run time.

clone :: Regex -> IO RegexSource

Make a copy of a compiled regular expression. Cloning a regular expression is faster than opening a second instance from the source form of the expression, and requires less memory.

Note that the current input string and the position of any matched text within it are not cloned; only the pattern itself and and the match mode flags are copied.

Cloning can be particularly useful to threaded applications that perform multiple match operations in parallel. Each concurrent RE operation requires its own instance of a Regex.

Managing text to search

setText :: Regex -> Text -> IO ()Source

Set the subject text string upon which the regular expression will look for matches. This function may be called any number of times, allowing the regular expression pattern to be applied to different strings.

getText :: Regex -> IO (ForeignPtr Word16, I16)Source

Get the subject text that is currently associated with this regular expression object.

Inspection

pattern :: Regex -> TextSource

Return the source form of the pattern used to construct this regular expression or match.

Searching

find :: Regex -> I16 -> IO BoolSource

Find the first matching substring of the input string that matches the pattern.

If n is non-negative, the search for a match begins at the specified index, and any match region is reset.

If n is -1, the search begins at the start of the input region, or at the start of the full string if no region has been specified.

If a match is found, start, end, and group will provide more information regarding the match.

findNext :: Regex -> IO BoolSource

Find the next pattern match in the input string. Begin searching the input at the location following the end of he previous match, or at the start of the string (or region) if there is no previous match.

If a match is found, start, end, and group will provide more information regarding the match.

Match groups

Capturing groups are numbered starting from zero. Group zero is always the entire matching text. Groups greater than zero contain the text matching each capturing group in a regular expression.

groupCount :: Regex -> IO IntSource

Return the number of capturing groups in this regular expression's pattern.

start :: Regex -> Int -> IO (Maybe I16)Source

Returns the index in the input string of the start of the text matched by the specified capture group during the previous match operation. Returns Nothing if the capture group was not part of the last match.

end :: Regex -> Int -> IO (Maybe I16)Source

Returns the index in the input string of the end of the text matched by the specified capture group during the previous match operation. Returns Nothing if the capture group was not part of the last match.

start_ :: Regex -> Int -> IO I16Source

Returns the index in the input string of the start of the text matched by the specified capture group during the previous match operation. Returns -1 if the capture group was not part of the last match.

end_ :: Regex -> Int -> IO I16Source

Returns the index in the input string of the end of the text matched by the specified capture group during the previous match operation. Returns -1 if the capture group was not part of the last match.