The weighted-regexp package

[Tags: bsd3, library]

Haskell implementation of a weighted regular expression matcher with linear worst-case time and space bounds.

[Skip to ReadMe]


Versions0.1.0.0,,,,, 0.3.1,,
Change logCHANGES.markdown
Dependenciesarray (>=0.1 && <0.4), base (>=3 && <5), criterion (==0.5.*), QuickCheck (<2) [details]
AuthorThomas Wilke, Frank Huch, Sebastian Fischer
MaintainerSebastian Fischer
CategoryText, Parsing
Home page
Bug tracker
Source repositoryhead: git clone git://
Executablescriterion-re, quickcheck-re
UploadedTue Nov 30 07:29:42 UTC 2010 by SebastianFischer
Downloads1201 total (49 in last 30 days)
0 []
StatusDocs uploaded by user
Build status unknown [no reports yet]




quickcheckBuild executable to run QuickCheck testsDisabledAutomatic
criterionBuild executable to run Criterion benchmarksDisabledAutomatic

Use -f <flag> to enable a flag, or -f -<flag> to disable that flag. More info


Maintainers' corner

For package maintainers and hackage trustees

Readme for weighted-regexp-0.3.1

% Weighted RegExp Matching

Efficient regular expression matching can be beautifully simple. Revisiting ideas from theoretical computer science, it can be implemented with linear worst-case time and space bounds in the purely functional programming language Haskell.


Since Plato wrote about philosophy in the form of dialogues, authors have used this literary form to convey their ideas. The 15th International Conference on Functional Programming features an article on Regular Expressions written as a play, A Play on Regular Expressions, which is meant to be elegant, instructive, and fun. The play discusses an efficient, purely functional algorithm for matching regular expressions. By generalizing from Booleans to arbitrary semirings, this algorithm implements various matching policies for weighted regular expressions.


An implementation of the ideas discussed in the Play on Regular Expressions is available as a Haskell library. It is implemented in pure Haskell rather than as a binding to an external library so you do not need to install an external regular expression library to use it.


<a href=""> <img src="" /> </a>


However, you need Haskell in order to use this library. By installing the Haskell Platform you get a Haskell compiler with an interactive environment as well as the package manager cabal-install and various pre-installed packages.


<img src="" alt="Cabal" width="195" height="71" />


You can install the weighted-regexp library by typing the following into a terminal:

bash# cabal update
bash# cabal install weighted-regexp


This will install the current version. Differences between versions are listed in the changelog.


The matching algorithm computes the same result as a simple inductive specification (given in the Play on Regular Expressions) but is more efficient than a direct translation of this specification into Haskell. Although the ideas behind the algorithm are not new but based on proven results from theoretical computer science, there is no correctness proof for the equivalence of the Haskell implementation of the algorithm with its specification. The equivalence is therefore confirmed by testing.

It is difficult (and tedious) to write tests manually that cover all interesting apsects of regular expression matching. Therefore, QuickCheck is used to generate tests automatically and Haskell Program Coverage (HPC) is used to monitor test coverage.

You can install the weighted-regexp library along with a test program as follows:

bash# cabal install weighted-regexp -fQuickCheck

Using the QuickCheck flag results in an additional program that you can use to test the implementation. The program tests

For testing the equivalence, QuickCheck generates random regular expressions and compares the result of the matching algorithm with the result of its specification on random words. Moreover, the program tests

For a more detailed description of the tested properties consider the source code of the test program. In order to generate an HPC report you need to download the sources of the weighted-regexp package. But you may as well consult the pregenerated coverage report instead of generating one yourself.


The matching algorithm provided by this library is usually slower than other libraries like pcre but has a better asymptotic complexity. There are no corner cases for which matching takes forever or eats all available memory. More specifically, the worst-case run time for matching a word against a regular expression is linearly bounded by the length of the word and the size of the regular expression. It is in O(nm) if n is the length of the word and m the size of the expression. The memory requirements are independent of the length of the word and linear in the size of the regular expression, that is, in O(m). Therefore, this library provides similar asymptotic complexity guarantees as Google's re2.

Here are timings that have been obtained (on a MacBook) with the current version of the library.

   input               regexp            run time     memory

100 MB of a's .* 8s (12 MB/s) 1 MB 5000 a's (a?){5000}a{5000} 13s 5 MB ~2M a's and b's .*a.{20}a.* 3.6s 1 MB

The first example measures the search speed for a simple regular expression with a long string. There is room for improvement. No time has been invested yet to improve the performance of the library with regard to constant factors.

The second example demonstrates the good asymptotic complexity of the algorithm. Unlike a backtracking implementation like pcre the library finishes in reasonable time. However, the memory requirements are higher than usual and on closer inspection one can see that almost 10 of 13 seconds are spent during garbage collection. This example uses a large regular expression which leads to a lot of garbage in the matching algorithm.

The third example pushes automata based approaches to the limit because the deterministic finite automaton that corresponds to the regular expression is exponentially large. The input has been chosen to not match the expression but is otherwise random and probably explores many different states of the automaton. The matching algorithm produces states on the fly and discards them, hence, it is fast in this example, in fact, faster than re2[^cpp].

<script src=""></script>

Unlike the Haskell program, this program keeps the whole input,
that is, the result of `getline`, in memory. Can [re2] match input
on the fly?

The benchmarks above all use large input and two of them are specifically designed as corner cases of typical matching algorithms. The run time of matching more common regular expressions against short input has been measured using Criterion in order to get statistically robust results.

You can install the weighted-regexp package with the Criterion flag to generate a program that executes the benchmarks described below:

bash# cabal install weighted-regexp -fCriterion

You can call criterion-re --help to see how to use the generated program. It tests three different examples:

For a more detailed explanation consider the source code of the benchmark program.

   matching  acceptance  #matchings  leftmost     longest  leftmost longest

unique full 3.8 us ambiguous full 11.7 us partial 20.4 us 26.2 us

Click on the numbers for a more detailed distribution of run times.



<a href=""> <img src="" /> </a>


The source code of this library is on github. You can collaborate by using it in your projects, report bugs and ask for new features in the issue tracker, or provide patches that implement pending issues.


The algorithm discussed in the Play on Regular Expressions has been implemented in different languages. In a series of two blog posts, Carl Friedrich Bolz describes a Python implementation that uses a Just In Time (JIT) compiler to achieve impressive performance. He compares his version with corresponding C++ and Java programs.

For questions and feedback email Sebastian Fischer.