weighted-regexp-0.3.0.1: Weighted Regular Expression Matcher

Stabilityexperimental
MaintainerSebastian Fischer <mailto:sebf@informatik.uni-kiel.de>

Text.RegExp

Contents

Description

This library provides a simple and fast regular expression matcher that is implemented in Haskell without binding to external libraries.

There are different ways to implement regular expression matching. Backtracking algorithms are simple but need bookkeeping overhead for nondeterministic search. One can use deterministic finite automata (DFA, see http://swtch.com/~rsc/regexp/regexp1.html) to match regular expressions faster. But for certain regular expressions these DFA are exponentially large which sometimes leads to prohibitive memory requirements.

We use a smart and simple algorithm to generate a DFA from a regular expression and do not generate the DFA completely but on the fly while parsing. This leads to a linear-time deterministic algorithm with constant space requirements. More specifically, the run time is limited by the product of the sizes of the regular expression and the string and the memory is limited by the size of the regular expression.

Synopsis

Documentation

class Semiring w => Weight a b w whereSource

Methods

symWeight :: (a -> w) -> b -> wSource

Instances

Weight c c Bool 
Weight c c Longest 
Num a => Weight c c (Numeric a) 
Weight c (Int, c) Leftmost 
Weight c (Int, c) LeftLong 

Constructing regular expressions

data RegExp c Source

Regular expressions are represented as values of type RegExp c where c is the character type of the underlying alphabet. Values of type RegExp c can be matched against lists of type [c].

fromString :: String -> RegExp CharSource

Parses a regular expression from its string representation. If the OverloadedStrings language extension is enabled, string literals can be used as regular expressions without using fromString explicitly. Implicit conversion is especially useful in combination with functions like =~ that take a value of type RegExp Char as argument.

Here are some examples of supported regular expressions along with an explanation what they mean:

  • a matches the character a
  • [abc] matches any of the characters a, b, or c. It is equivalent to (a|b|c), but | can be used to specify alternatives between arbitrary regular expressions, not only characters.
  • [^abc] matches anything but the characters a, b, or c.
  • \d matches a digit and is equivalent to [0-9]. Moreover, \D matches any non-digit character, \s and \S match space and non-space characters and \w and \W match word characters and non-word characters, that is, \w abbreviates [a-zA-Z_].
  • a? matches the empty word or the character a, a* matches zero or more occurrences of a, and a+ matches one or more a's.
  • . (the dot) matches one arbitrary character.
  • a{4,7} matches four to seven occurrences of a, a{2} matches two.

eps :: RegExp cSource

Matches the empty word. eps has no direct string representation but is used to implement other constructs such as optional components like a?.

char :: Char -> RegExp CharSource

Matches the given character.

sym :: (Eq c, Show c) => c -> RegExp cSource

Matches the given symbol.

psym :: String -> (c -> Bool) -> RegExp cSource

Matches a symbol that satisfies the given predicate.

anySym :: RegExp cSource

Matches an arbitrary symbol.

noMatch :: RegExp cSource

Does not match anything. noMatch is an identity for alt.

alt :: RegExp c -> RegExp c -> RegExp cSource

Matches either of two regular expressions. For example a+b matches either the character a or the character b.

seq_ :: RegExp c -> RegExp c -> RegExp cSource

Matches the sequence of two regular expressions. For example the regular expressions ab matches the word ab.

rep :: RegExp c -> RegExp cSource

Matches zero or more occurrences of the given regular expression. For example a* matches the character a zero or more times.

rep1 :: RegExp c -> RegExp cSource

Matches one or more occurrences of the given regular expression. For example a+ matches the character a one or more times.

opt :: RegExp c -> RegExp cSource

Matches the given regular expression or the empty word. Optional expressions are usually written a? but could also be written (|a), that is, as alternative between eps and a.

brep :: (Int, Int) -> RegExp c -> RegExp cSource

Matches a regular expression a given number of times. For example, the regular expression a{4,7} matches the character a four to seven times. If the minimal and maximal occurences are identical, one can be left out, that is, a{2} matches two occurrences of the character a.

Numerical bounds are implemented via translation into ordinary regular expressions. For example, a{4,7} is translated into aaaa(a(a(a)?)?)?.

perm :: [RegExp c] -> RegExp cSource

Matches a sequence of the given regular expressions in any order. For example, the regular expression

 perm (map char "abc")

has the same meaning as

 abc|acb|bca|bac|cba|cab

and is represented as

 a(bc|cb)|b(ca|ac)|c(ba|ab)

Matching

(=~) :: RegExp Char -> String -> BoolSource

Alias for acceptFull specialized for Strings. Useful in combination with the IsString instance for RegExp Char

acceptFull :: RegExp c -> [c] -> BoolSource

Checks whether a regular expression matches the given word. For example, acceptFull (fromString "b|abc") "b" yields True because the first alternative of b|abc matches the string "b".

acceptPartial :: RegExp c -> [c] -> BoolSource

Checks whether a regular expression matches a subword of the given word. For example, acceptPartial (fromString "b") "abc" yields True because "abc" contains the substring "b".

matchingCount :: Num a => RegExp c -> [c] -> aSource

Computes in how many ways a word can be matched against a regular expression.

fullMatch :: Weight a b w => RegExp a -> [b] -> wSource

Matches a regular expression against a word computing a weight in an arbitrary semiring.

The symbols can have associated weights associated by the symWeight function of the Weight class. This function also allows to adjust the type of the used alphabet such that, for example, positional information can be taken into account by zipping the word with positions.

partialMatch :: Weight a b w => RegExp a -> [b] -> wSource

Matches a regular expression against substrings of a word computing a weight in an arbitrary semiring. Similar to fullMatch the Weight class is used to associate weights to the symbols of the regular expression.