Copyright (C) 2008 Martin Sulzmann. All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above
copyright notice, this list of conditions and the following
disclaimer in the documentation and/or other materials provided
with the distribution.
* Neither the name of Isaac Jones nor the names of other
contributors may be used to endorse or promote products derived
from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
We implement some standard regular expression operations:
- equality testing
- containment testing
- intersection
The classic approach is to turn regular expressions
into automata. Something we have learned at university.
Here, we consider a novel approach where we only
symbolically manipulate regular expressions to implement
the above operations. The approach relies on
Antimirov's regular expression rewriting algorithm
based on partial derivatives.
The (intersection) algorithm (highlights):
r1 "intersect" r2 =
(l1, partDeriv r1 l1 "intersect" partDeriv r2 l1)
| ...
| (l1, partDeriv r1 ln "intersect" partDeriv r2 ln)
where we assume
[l1,...,ln] = lettersOf (r1 | r2)
The idea is that each regular expression is turned into a
'disjunction' of clauses where each clause starts with
a letter, followed by the partial derivative of the regular
expression wrt that letter.
You probably will know Brozozowski's derivative operation
which says that
derivative r l = [ w | (l,w) in L(r) ]
If we would use derivatives the above algorithm
is non-terminating. For example,
deriv (Star (L 'A')) 'A'
--> <<>,'A'*>
deriv (deriv (Star (L 'A')) 'A') 'A'
--> (<{},'A'*>|<<>,'A'*>)
and so on. The nice property of partial derivatives is that
they are 'finite'. Each regular expression has a finite
number of partial derivatives. For example,
partDeriv (Star (L 'A')) 'A'
--> <<>,'A'*>
partDeriv (partDeriv (Star (L 'A')) 'A') 'A'
--> <<>,'A'*>
Please check the paper
Valentin M. Antimirov: Partial Derivatives of Regular Expressions and Finite Automaton Constructions.
Theor. Comput. Sci. 155(2): 291-319 (1996)
for more details on derivatives, partial derivatives and their connection
(roughly, derivatives correspond to DFA states and partial derivatives correspond to NFA states).
There's another source of non-termination which we explain via
the following example.
A* `intersect` A*
--> (A, A* `intersect` A*)
where we simplified
partDeriv A* A --> <<>,A*> to A*
The point to note is that the proof tree is infinite. The standard approach
is to use co-induction which then yields a finite (recursive) proof.
OK, here's finally the Haskell implementation:
>
We're just using the GADT syntax to write 'ordinary' algebraic data types
> module RegExpr.RegExprOperations where
> import List
> data RE a where
> Phi :: RE a
> Empty :: RE a
> L :: a -> RE a
> Choice :: RE a -> RE a -> RE a
> Seq :: RE a -> RE a -> RE a
> Star :: RE a -> RE a
> Var :: Int -> RE a
> deriving Eq
A word is a list of alphabet letters
> type Word a = [a]
Pretty printing of regular expressions
> instance Show a => Show (RE a) where
> show Phi = "{}"
> show Empty = "<>"
> show (L c) = show c
> show (Choice r1 r2) = ("(" ++ show r1 ++ "|" ++ show r2 ++ ")")
> show (Seq r1 r2) = ("<" ++ show r1 ++ "," ++ show r2 ++ ">")
> show (Star r) = (show r ++ "*")
Turning a list of regular expressions into a regular expression by using (choice)
> resToRE :: [RE a] -> RE a
> resToRE (r:res) = foldl Choice r res
> resToRE [] = Phi
Computing the set of letters of a regular expression
> sigmaRE :: Eq a => RE a -> [a]
> sigmaRE (L l) = [l]
> sigmaRE (Seq r1 r2) = nub ((sigmaRE r1) ++ (sigmaRE r2))
> sigmaRE (Choice r1 r2) = nub ((sigmaRE r1) ++ (sigmaRE r2))
> sigmaRE (Star r) = sigmaRE r
> sigmaRE Phi = []
> sigmaRE Empty = []
Testing if a regular expression is empty (empty word)
> isEmpty :: RE a -> Bool
> isEmpty Phi = False
> isEmpty Empty = True
> isEmpty (Choice r1 r2) = (isEmpty r1) || (isEmpty r2)
> isEmpty (Seq r1 r2) = (isEmpty r1) && (isEmpty r2)
> isEmpty (Star r) = True
> isEmpty (L _) = False
Testing if a regular expression contains nothing
> isPhi :: RE a -> Bool
> isPhi Phi = True
> isPhi Empty = False
> isPhi (Choice r1 r2) = (isPhi r1) && (isPhi r2)
> isPhi (Seq r1 r2) = (isPhi r1) || (isPhi r2)
> isPhi (Star r) = False
> isPhi (L _) = False
Brozozowski's derivative operation
deriv r l denotes the regular expression where the
"leading l has been removed"
(not necessary for the intersection algorithm,
only included to illustrate the difference
to the upcoming partial derivative algorithm)
> deriv :: Eq a => RE a -> a -> RE a
> deriv Phi _ = Phi
> deriv Empty _ = Phi
> deriv (L l1) l2
> | l1 == l2 = Empty
> | otherwise = Phi
> deriv (Choice r1 r2) l =
> Choice (deriv r1 l) (deriv r2 l)
> deriv (Seq r1 r2) l =
> if isEmpty r1
> then Choice (Seq (deriv r1 l) r2) (deriv r2 l)
> else Seq (deriv r1 l) r2
> deriv (this@(Star r)) l =
> Seq (deriv r l) this
(a variant) of Antimirov's partial derivative operation
Antimirov demands that partDeriv (Star (L 'A')) 'A' yields [A*]
whereas our version yields [<<>,'A'*>].
The difference is not essential here.
> partDeriv :: Eq a => RE a -> a -> [RE a]
> partDeriv Phi l = []
> partDeriv Empty l = []
> partDeriv (L l') l
> | l == l' = [Empty]
> | otherwise = []
> partDeriv (Choice r1 r2) l = nub ((partDeriv r1 l) ++ (partDeriv r2 l))
> partDeriv (Seq r1 r2) l
> | isEmpty r1 =
> let s1 = [ (Seq r1' r2) | r1' <- partDeriv r1 l ]
> s2 = partDeriv r2 l
> in nub (s1 ++ s2)
> | otherwise = [ (Seq r1' r2) | r1' <- partDeriv r1 l ]
> partDeriv (Star r) l = [ (Seq r' (Star r)) | r' <- partDeriv r l ]
Here's a failed attempt of the partial derivative based intersection algorithm.
> type Env a = [((RE a, RE a), RE a)]
> intersectREFAiled :: Eq a => RE a -> RE a -> RE a
> intersectREFAiled r1 r2 = intersectCFailed [] r1 r2
> intersectCFailed :: Eq a => Env a -> RE a -> RE a -> RE a
> intersectCFailed env r1 r2
> | r1 == Phi || r2 == Phi = Phi
> | r1 == Empty || r2 == Empty = Empty
> | otherwise =
> case lookup (r1,r2) env of
> Just r -> r
> Nothing ->
> let letters = sigmaRE (r1 `Choice` r2)
> env' = ((r1,r2),r):env
> r1l l = resToRE $ partDeriv r1 l
> r2l l = resToRE $ partDeriv r2 l
> r' = resToRE $ map (\l -> Seq (L l) (intersectCFailed env' (r1l l) (r2l l))) letters
> r = if (isEmpty r1) && (isEmpty r2)
> then Choice r' Empty
> else r'
> in r
The problem with the algorithm is that we use recursion in Haskell to model
recursive proofs. For example, try
intersectREFailed rAstar rAstar
We need first-order syntax to represent recursive proofs/regular expressions.
The convert function below then builds the non-recursive regular expression.
Converting a regular equation into a regular expression
We assume that variables x appears (if at all) at position (r,Var x)
convert2 traveres the regexp and yields (r1,r2)
where the invariant is that r1 is part of the loop and r2 is the base case
> convert :: Int -> RE a -> RE a
> convert x r = let (r1,r2) = convert2 x r
> in Seq (Star r1) r2
> convert2 :: Int -> RE a -> (RE a, RE a)
> convert2 x Empty = (Empty, Empty)
> convert2 x (Var y)
> | x == y = (Empty,Empty)
> | otherwise = (Empty, Var y)
> convert2 x (r@(Seq l r1))
> | contains x r1 = let (r2,r3) = convert2 x r1
> in (Seq l r2, r3)
> | otherwise = (Empty, r)
> convert2 x (Choice r1 r2) = let (r1', r1'') = convert2 x r1
> (r2', r2'') = convert2 x r2
> in (Choice r1' r2', Choice r1'' r2'')
> contains :: Int -> RE a -> Bool
> contains x (Var y) = x == y
> contains x (Seq r1 r2) = contains x r1 || contains x r2
> contains x (Star r) = contains x r
> contains x (Choice r1 r2) = contains x r1 || contains x r2
> contains x _ = False
Here's the successful attempt of the partial derivative based intersection algorithm.
> intersectRE :: Eq a => RE a -> RE a -> RE a
> intersectRE r1 r2 = intersectC 1 [] r1 r2
> intersectC :: Eq a => Int -> Env a -> RE a -> RE a -> RE a
> intersectC cnt env r1 r2
> | r1 == Phi || r2 == Phi = Phi
> | r1 == Empty || r2 == Empty = Empty
> | otherwise =
> case lookup (r1,r2) env of
> Just r -> r
> Nothing ->
> let letters = sigmaRE (r1 `Choice` r2)
> env' = ((r1,r2),Var cnt):env
> r1l l = resToRE $ partDeriv r1 l
> r2l l = resToRE $ partDeriv r2 l
> r' = resToRE $ map (\l -> Seq (L l) (intersectC (cnt+1) env' (r1l l) (r2l l))) letters
> r = if (isEmpty r1) && (isEmpty r2)
> then Choice r' Empty
> else r'
> in convert cnt r
For testing purposes, it's handy to have a function which tests for (semantic)
equality among regular expressions (again written using partial derivatives).
> type EnvEq a = [(RE a, RE a)]
> eqRE :: Eq a => RE a -> RE a -> Bool
> eqRE r1 r2 = eqREC [] r1 r2
> eqREC :: Eq a => EnvEq a -> RE a -> RE a -> Bool
> eqREC env r1 r2
> | isEmpty r1 && (not (isEmpty r2)) = False
> | isPhi r1 && (not (isPhi r2)) = False
> | otherwise =
> if elem (r1,r2) env
> then True
> else let letters = sigmaRE (r1 `Choice` r2)
> env' = (r1,r2):env
> r1l l = resToRE $ partDeriv r1 l
> r2l l = resToRE $ partDeriv r2 l
> in and $ map (\l -> eqREC env' (r1l l) (r2l l)) letters
> containsRECheap :: Eq a => RE a -> RE a -> Bool
> containsRECheap r1 r2 = eqRE r1 (intersectRE r1 r2)
> containsRE :: Eq a => RE a -> RE a -> Bool
> containsRE r1 r2 = containsC [] r1 r2
> containsC :: Eq a => EnvEq a -> RE a -> RE a -> Bool
> containsC env r1 r2
> | r1 == Empty = isEmpty r2
> | r1 == Phi = True
> | r2 == Phi = eqRE r1 Phi
> | r2 == Empty = eqRE r1 Empty
> | elem (r1,r2) env = True
> | otherwise =
> let letters = sigmaRE (r1 `Choice` r2)
> env' = (r1,r2) :env
> r1l l = resToRE $ partDeriv r1 l
> r2l l = resToRE $ partDeriv r2 l
> b = and $ map (\l -> containsC env' (r1l l) (r2l l)) letters
> in b