Stability | experimental |
---|---|
Maintainer | Sebastian Fischer <mailto:sebf@informatik.uni-kiel.de> |
This library provides a simple and fast regular expression matcher that is implemented in Haskell without binding to external libraries.
There are different ways to implement regular expression matching. Backtracking algorithms are simple but need bookkeeping overhead for nondeterministic search. One can use deterministic finite automata (DFA, see http://swtch.com/~rsc/regexp/regexp1.html) to match regular expressions faster. But for certain regular expressions these DFA are exponentially large which sometimes leads to prohibitive memory requirements.
We use a smart and simple algorithm to generate a DFA from a regular expression and do not generate the DFA completely but on the fly while parsing. This leads to a linear-time deterministic algorithm with constant space requirements. More specifically, the run time is limited by the product of the sizes of the regular expression and the string and the memory is limited by the size of the regular expression.
- module Data.Semiring
- class Semiring w => Weight a b w where
- symWeight :: (a -> w) -> b -> w
- data RegExp c
- fromString :: String -> RegExp Char
- eps :: RegExp c
- char :: Char -> RegExp Char
- sym :: (Eq c, Show c) => c -> RegExp c
- psym :: String -> (c -> Bool) -> RegExp c
- anySym :: RegExp c
- noMatch :: RegExp c
- alt :: RegExp c -> RegExp c -> RegExp c
- seq_ :: RegExp c -> RegExp c -> RegExp c
- rep :: RegExp c -> RegExp c
- rep1 :: RegExp c -> RegExp c
- opt :: RegExp c -> RegExp c
- brep :: (Int, Int) -> RegExp c -> RegExp c
- perm :: [RegExp c] -> RegExp c
- (=~) :: RegExp Char -> String -> Bool
- acceptFull :: RegExp c -> [c] -> Bool
- acceptPartial :: RegExp c -> [c] -> Bool
- matchingCount :: Num a => RegExp c -> [c] -> a
- fullMatch :: Weight a b w => RegExp a -> [b] -> w
- partialMatch :: Weight a b w => RegExp a -> [b] -> w
Documentation
module Data.Semiring
Constructing regular expressions
Regular expressions are represented as values of type RegExp
c
where c
is the character type of the underlying alphabet. Values
of type RegExp
c
can be matched against lists of type [c]
.
fromString :: String -> RegExp CharSource
Parses a regular expression from its string representation. If the
OverloadedStrings
language extension is enabled, string literals
can be used as regular expressions without using fromString
explicitly. Implicit conversion is especially useful in combination
with functions like =~
that take a value of type RegExp Char
as
argument.
Here are some examples of supported regular expressions along with an explanation what they mean:
-
a
matches the charactera
-
[abc]
matches any of the charactersa
,b
, orc
. It is equivalent to(a|b|c)
, but|
can be used to specify alternatives between arbitrary regular expressions, not only characters. -
[^abc]
matches anything but the charactersa
,b
, orc
. -
\d
matches a digit and is equivalent to[0-9]
. Moreover,\D
matches any non-digit character,\s
and\S
match space and non-space characters and\w
and\W
match word characters and non-word characters, that is,\w
abbreviates[a-zA-Z_]
. -
a?
matches the empty word or the charactera
,a*
matches zero or more occurrences ofa
, anda+
matches one or morea
's. -
.
(the dot) matches one arbitrary character. -
a{4,7}
matches four to seven occurrences ofa
,a{2}
matches two.
Matches the empty word. eps
has no direct string representation
but is used to implement other constructs such as optional
components like a?
.
alt :: RegExp c -> RegExp c -> RegExp cSource
Matches either of two regular expressions. For example a+b
matches either the character a
or the character b
.
seq_ :: RegExp c -> RegExp c -> RegExp cSource
Matches the sequence of two regular expressions. For example the
regular expressions ab
matches the word ab
.
rep :: RegExp c -> RegExp cSource
Matches zero or more occurrences of the given regular
expression. For example a*
matches the character a
zero or
more times.
rep1 :: RegExp c -> RegExp cSource
Matches one or more occurrences of the given regular
expression. For example a+
matches the character a
one or
more times.
opt :: RegExp c -> RegExp cSource
Matches the given regular expression or the empty word. Optional
expressions are usually written a?
but could also be written
(|a)
, that is, as alternative between eps
and a
.
brep :: (Int, Int) -> RegExp c -> RegExp cSource
Matches a regular expression a given number of times. For example,
the regular expression a{4,7}
matches the character a
four to
seven times. If the minimal and maximal occurences are identical,
one can be left out, that is, a{2}
matches two occurrences of the
character a
.
Numerical bounds are implemented via translation into ordinary
regular expressions. For example, a{4,7}
is translated into
aaaa(a(a(a)?)?)?
.
perm :: [RegExp c] -> RegExp cSource
Matches a sequence of the given regular expressions in any order. For example, the regular expression
perm (map char "abc")
has the same meaning as
abc|acb|bca|bac|cba|cab
and is represented as
a(bc|cb)|b(ca|ac)|c(ba|ab)
Matching
(=~) :: RegExp Char -> String -> BoolSource
Alias for acceptFull
specialized for Strings. Useful in combination
with the IsString
instance for RegExp
Char
acceptFull :: RegExp c -> [c] -> BoolSource
Checks whether a regular expression matches the given word. For
example, acceptFull (fromString "b|abc") "b"
yields True
because the first alternative of b|abc
matches the string
"b"
.
acceptPartial :: RegExp c -> [c] -> BoolSource
Checks whether a regular expression matches a subword of the given
word. For example, acceptPartial (fromString "b") "abc"
yields True
because "abc"
contains the substring "b"
.
matchingCount :: Num a => RegExp c -> [c] -> aSource
Computes in how many ways a word can be matched against a regular expression.
fullMatch :: Weight a b w => RegExp a -> [b] -> wSource
Matches a regular expression against a word computing a weight in an arbitrary semiring.
The symbols can have associated weights associated by the
symWeight
function of the Weight
class. This function also
allows to adjust the type of the used alphabet such that, for
example, positional information can be taken into account by
zip
ping the word with positions.