text-icu-0.8.0: Bindings to the ICU library
Copyright(c) 2010 Bryan O'Sullivan
LicenseBSD-style
Maintainerbos@serpentine.com
Stabilityexperimental
PortabilityGHC
Safe HaskellSafe-Inferred
LanguageHaskell98

Data.Text.ICU.Regex

Description

Regular expression support for Unicode, implemented as bindings to the International Components for Unicode (ICU) libraries.

The syntax and behaviour of ICU regular expressions are Perl-like. For complete details, see the ICU User Guide entry at http://userguide.icu-project.org/strings/regexp.

Note: The functions in this module are not thread safe. For thread safe use, see clone below, or use the pure functions in Data.Text.ICU.

Synopsis

Types

data MatchOption Source #

Options for controlling matching behaviour.

Constructors

CaseInsensitive

Enable case insensitive matching.

Comments

Allow comments and white space within patterns.

DotAll

If set, '.' matches line terminators. Otherwise '.' matching stops at line end.

Literal

If set, treat the entire pattern as a literal string. Metacharacters or escape sequences in the input sequence will be given no special meaning.

The option CaseInsensitive retains its meanings on matching when used in conjunction with this option. Other options become superfluous.

Multiline

Control behaviour of '$' and '^'. If set, recognize line terminators within string, Otherwise, match only at start and end of input string.

HaskellLines

Haskell-only line endings. When this mode is enabled, only '\n' is recognized as a line ending in the behavior of '.', '^', and '$'.

UnicodeWord

Unicode word boundaries. If set, '\\b' uses the Unicode TR 29 definition of word boundaries.

Warning: Unicode word boundaries are quite different from traditional regular expression word boundaries. See http://unicode.org/reports/tr29/#Word_Boundaries.

ErrorOnUnknownEscapes

Throw an error on unrecognized backslash escapes. If set, fail with an error on patterns that contain backslash-escaped ASCII letters without a known special meaning. If this flag is not set, these escaped letters represent themselves.

WorkLimit Int

Set a processing limit for match operations.

Some patterns, when matching certain strings, can run in exponential time. For practical purposes, the match operation may appear to be in an infinite loop. When a limit is set a match operation will fail with an error if the limit is exceeded.

The units of the limit are steps of the match engine. Correspondence with actual processor time will depend on the speed of the processor and the details of the specific pattern, but will typically be on the order of milliseconds.

By default, the matching time is not limited.

StackLimit Int

Set the amount of heap storage avaliable for use by the match backtracking stack.

ICU uses a backtracking regular expression engine, with the backtrack stack maintained on the heap. This function sets the limit to the amount of memory that can be used for this purpose. A backtracking stack overflow will result in an error from the match operation that caused it.

A limit is desirable because a malicious or poorly designed pattern can use excessive memory, potentially crashing the process. A limit is enabled by default.

Instances

Instances details
Show MatchOption Source # 
Instance details

Defined in Data.Text.ICU.Regex.Internal

Eq MatchOption Source # 
Instance details

Defined in Data.Text.ICU.Regex.Internal

data ParseError Source #

Detailed information about parsing errors. Used by ICU parsing engines that parse long rules, patterns, or programs, where the text being parsed is long enough that more information than an ICUError is needed to localize the error.

data Regex Source #

A compiled regular expression.

Regex values are usually constructed using the regex or regex' functions. This type is also an instance of IsString, so if you have the OverloadedStrings language extension enabled, you can construct a Regex by simply writing the pattern in quotes (though this does not allow you to specify any Options).

Instances

Instances details
Show Regex Source # 
Instance details

Defined in Data.Text.ICU.Regex

Methods

showsPrec :: Int -> Regex -> ShowS #

show :: Regex -> String #

showList :: [Regex] -> ShowS #

Functions

Construction

regex :: [MatchOption] -> Text -> IO Regex Source #

Compile a regular expression with the given options. This function throws a ParseError if the pattern is invalid.

The Regex is initialized with empty text to search against.

regex' :: [MatchOption] -> Text -> IO (Either ParseError Regex) Source #

Compile a regular expression with the given options. This is safest to use when the pattern is constructed at run time.

clone :: Regex -> IO Regex Source #

Make a copy of a compiled regular expression. Cloning a regular expression is faster than opening a second instance from the source form of the expression, and requires less memory.

Note that the current input string and the position of any matched text within it are not cloned; only the pattern itself and and the match mode flags are copied.

Cloning can be particularly useful to threaded applications that perform multiple match operations in parallel. Each concurrent RE operation requires its own instance of a Regex.

Managing text to search

setText :: Regex -> Text -> IO () Source #

Set the subject text string upon which the regular expression will look for matches. This function may be called any number of times, allowing the regular expression pattern to be applied to different strings.

getUTextPtr :: Regex -> IO UTextPtr Source #

Get the subject text that is currently associated with this regular expression object.

Inspection

pattern :: Regex -> Text Source #

Return the source form of the pattern used to construct this regular expression or match.

Searching

find :: Regex -> TextI -> IO Bool Source #

Find the first matching substring of the input string that matches the pattern.

If n is non-negative, the search for a match begins at the specified index, and any match region is reset.

If n is -1, the search begins at the start of the input region, or at the start of the full string if no region has been specified.

If a match is found, start, end, and group will provide more information regarding the match.

findNext :: Regex -> IO Bool Source #

Find the next pattern match in the input string. Begin searching the input at the location following the end of he previous match, or at the start of the string (or region) if there is no previous match.

If a match is found, start, end, and group will provide more information regarding the match.

Match groups

Capturing groups are numbered starting from zero. Group zero is always the entire matching text. Groups greater than zero contain the text matching each capturing group in a regular expression.

groupCount :: Regex -> IO Int Source #

Return the number of capturing groups in this regular expression's pattern.

start :: Regex -> Int -> IO (Maybe TextI) Source #

Returns the index in the input string of the start of the text matched by the specified capture group during the previous match operation. Returns Nothing if the capture group was not part of the last match.

end :: Regex -> Int -> IO (Maybe TextI) Source #

Returns the index in the input string of the end of the text matched by the specified capture group during the previous match operation. Returns Nothing if the capture group was not part of the last match.

start_ :: Regex -> Int -> IO TextI Source #

Returns the index in the input string of the start of the text matched by the specified capture group during the previous match operation. Returns -1 if the capture group was not part of the last match.

end_ :: Regex -> Int -> IO TextI Source #

Returns the index in the input string of the end of the text matched by the specified capture group during the previous match operation. Returns -1 if the capture group was not part of the last match.

Orphan instances

Show Regex Source # 
Instance details

Methods

showsPrec :: Int -> Regex -> ShowS #

show :: Regex -> String #

showList :: [Regex] -> ShowS #