RegExpr/RegExprOperations.lhs

Copyright (C) 2008 Martin Sulzmann. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * Neither the name of Isaac Jones nor the names of other contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. We implement some standard regular expression operations: - equality testing - containment testing - intersection The classic approach is to turn regular expressions into automata. Something we have learned at university. Here, we consider a novel approach where we only symbolically manipulate regular expressions to implement the above operations. The approach relies on Antimirov's regular expression rewriting algorithm based on partial derivatives. The (intersection) algorithm (highlights): r1 "intersect" r2 = (l1, partDeriv r1 l1 "intersect" partDeriv r2 l1) | ... | (l1, partDeriv r1 ln "intersect" partDeriv r2 ln) where we assume [l1,...,ln] = lettersOf (r1 | r2) The idea is that each regular expression is turned into a 'disjunction' of clauses where each clause starts with a letter, followed by the partial derivative of the regular expression wrt that letter. You probably will know Brozozowski's derivative operation which says that derivative r l = [ w | (l,w) in L(r) ] If we would use derivatives the above algorithm is non-terminating. For example, deriv (Star (L 'A')) 'A' --> <<>,'A'*> deriv (deriv (Star (L 'A')) 'A') 'A' --> (<{},'A'*>|<<>,'A'*>) and so on. The nice property of partial derivatives is that they are 'finite'. Each regular expression has a finite number of partial derivatives. For example, partDeriv (Star (L 'A')) 'A' --> <<>,'A'*> partDeriv (partDeriv (Star (L 'A')) 'A') 'A' --> <<>,'A'*> Please check the paper Valentin M. Antimirov: Partial Derivatives of Regular Expressions and Finite Automaton Constructions. Theor. Comput. Sci. 155(2): 291-319 (1996) for more details on derivatives, partial derivatives and their connection (roughly, derivatives correspond to DFA states and partial derivatives correspond to NFA states). There's another source of non-termination which we explain via the following example. A* `intersect` A* --> (A, A* `intersect` A*) where we simplified partDeriv A* A --> <<>,A*> to A* The point to note is that the proof tree is infinite. The standard approach is to use co-induction which then yields a finite (recursive) proof. OK, here's finally the Haskell implementation:

> {-# LANGUAGE GADTs #-}

We're just using the GADT syntax to write 'ordinary' algebraic data types

> module RegExpr.RegExprOperations where

> import List

> data RE a where
>   Phi :: RE a                     -- empty language
>   Empty :: RE a                   -- empty word
>   L :: a -> RE a                  -- single letter taken from alphabet a
>   Choice :: RE a -> RE a -> RE a  -- r1 + r2
>   Seq :: RE a -> RE a -> RE a     -- (r1,r2)
>   Star :: RE a -> RE a            -- r*
>   Var :: Int -> RE a              -- first-order variables to represent regular equations
>  deriving Eq

A word is a list of alphabet letters

> type Word a = [a]

Pretty printing of regular expressions

> instance Show a => Show (RE a) where
>     show Phi = "{}"
>     show Empty = "<>"
>     show (L c) = show c
>     show (Choice r1 r2) = ("(" ++ show r1 ++ "|" ++ show r2 ++ ")")
>     show (Seq r1 r2) = ("<" ++ show r1 ++ "," ++ show r2 ++ ">")
>     show (Star r) = (show r ++ "*")

Turning a list of regular expressions into a regular expression by using (choice)

> resToRE :: [RE a] -> RE a
> resToRE (r:res) = foldl Choice r res
> resToRE [] = Phi

Computing the set of letters of a regular expression

> sigmaRE :: Eq a => RE a -> [a]
> sigmaRE (L l) = [l]
> sigmaRE (Seq r1 r2) = nub ((sigmaRE r1) ++ (sigmaRE r2))
> sigmaRE (Choice r1 r2) = nub ((sigmaRE r1) ++ (sigmaRE r2))
> sigmaRE (Star r) = sigmaRE r
> sigmaRE Phi = []
> sigmaRE Empty = []

Testing if a regular expression is empty (empty word)

> isEmpty :: RE a -> Bool
> isEmpty Phi = False
> isEmpty Empty = True
> isEmpty (Choice r1 r2) = (isEmpty r1) || (isEmpty r2)
> isEmpty (Seq r1 r2) = (isEmpty r1) && (isEmpty r2)
> isEmpty (Star r) = True
> isEmpty (L _) = False

Testing if a regular expression contains nothing

> isPhi :: RE a -> Bool
> isPhi Phi = True
> isPhi Empty = False
> isPhi (Choice r1 r2) = (isPhi r1) && (isPhi r2)
> isPhi (Seq r1 r2) = (isPhi r1) || (isPhi r2)
> isPhi (Star r) = False
> isPhi (L _) = False

Brozozowski's derivative operation deriv r l denotes the regular expression where the "leading l has been removed" (not necessary for the intersection algorithm, only included to illustrate the difference to the upcoming partial derivative algorithm)

> deriv :: Eq a => RE a -> a -> RE a
> deriv Phi _ = Phi
> deriv Empty _ = Phi
> deriv (L l1) l2
>    | l1 == l2  = Empty
>    | otherwise = Phi
> deriv (Choice r1 r2) l = 
>   Choice (deriv r1 l) (deriv r2 l)
> deriv (Seq r1 r2) l =
>   if isEmpty r1 
>   then Choice (Seq (deriv r1 l) r2) (deriv r2 l)
>   else Seq (deriv r1 l) r2
> deriv (this@(Star r)) l =
>   Seq (deriv r l) this

(a variant) of Antimirov's partial derivative operation Antimirov demands that partDeriv (Star (L 'A')) 'A' yields [A*] whereas our version yields [<<>,'A'*>]. The difference is not essential here.

> partDeriv :: Eq a => RE a -> a -> [RE a]
> partDeriv Phi l = []
> partDeriv Empty l = []
> partDeriv (L l') l 
>     | l == l'   = [Empty]
>     | otherwise = []
> partDeriv (Choice r1 r2) l = nub ((partDeriv r1 l) ++ (partDeriv r2 l))
> partDeriv (Seq r1 r2) l 
>     | isEmpty r1 = 
>           let s1 = [ (Seq r1' r2) | r1' <- partDeriv r1 l ]
>               s2 = partDeriv r2 l
>           in nub (s1 ++ s2)
>     | otherwise = [ (Seq r1' r2) | r1' <- partDeriv r1 l ]
> partDeriv (Star r) l = [ (Seq r' (Star r)) | r' <- partDeriv r l ]

Here's a failed attempt of the partial derivative based intersection algorithm.

> type Env a = [((RE a, RE a), RE a)]

> intersectREFAiled :: Eq a => RE a -> RE a -> RE a
> intersectREFAiled r1 r2 = intersectCFailed [] r1 r2

> intersectCFailed :: Eq a => Env a -> RE a -> RE a -> RE a
> intersectCFailed env r1 r2 
>  | r1 == Phi || r2 == Phi = Phi
>  | r1 == Empty || r2 == Empty = Empty
>  | otherwise =
>      case lookup (r1,r2) env of
>       Just r -> r
>       Nothing ->
>         let letters = sigmaRE (r1 `Choice` r2)
>             env' = ((r1,r2),r):env
>             r1l l = resToRE $ partDeriv r1 l
>             r2l l = resToRE $ partDeriv r2 l
>             r' = resToRE $ map (\l -> Seq (L l) (intersectCFailed env' (r1l l) (r2l l))) letters
>             r =  if (isEmpty r1) && (isEmpty r2)
>                  then Choice r' Empty
>                  else r'
>         in r

The problem with the algorithm is that we use recursion in Haskell to model recursive proofs. For example, try intersectREFailed rAstar rAstar We need first-order syntax to represent recursive proofs/regular expressions. The convert function below then builds the non-recursive regular expression. Converting a regular equation into a regular expression We assume that variables x appears (if at all) at position (r,Var x) convert2 traveres the regexp and yields (r1,r2) where the invariant is that r1 is part of the loop and r2 is the base case

> convert :: Int -> RE a -> RE a
> convert x r = let (r1,r2) = convert2 x r
>               in Seq (Star r1) r2

> convert2 :: Int -> RE a -> (RE a, RE a)
> convert2 x Empty = (Empty, Empty)
> convert2 x (Var y)
>        | x == y    = (Empty,Empty)
>        | otherwise = (Empty, Var y) -- can this happen?
> convert2 x (r@(Seq l r1))
>        | contains x r1 = let (r2,r3) = convert2 x r1
>                          in (Seq l r2, r3)
>        | otherwise = (Empty, r)
> convert2 x (Choice r1 r2) = let (r1', r1'') = convert2 x r1
>                                 (r2', r2'') = convert2 x r2
>                             in (Choice r1' r2', Choice r1'' r2'')

> contains :: Int -> RE a -> Bool
> contains x (Var y) = x == y
> contains x (Seq r1 r2) = contains x r1 || contains x r2
> contains x (Star r) = contains x r
> contains x (Choice r1 r2) = contains x r1 || contains x r2
> contains x _ = False

Here's the successful attempt of the partial derivative based intersection algorithm.

> intersectRE :: Eq a => RE a -> RE a -> RE a
> intersectRE r1 r2 = intersectC 1 [] r1 r2

> intersectC :: Eq a => Int -> Env a -> RE a -> RE a -> RE a
> intersectC cnt env r1 r2 
>  | r1 == Phi || r2 == Phi = Phi
>  | r1 == Empty || r2 == Empty = Empty
>  | otherwise =
>      case lookup (r1,r2) env of
>       Just r -> r
>       Nothing ->
>         let letters = sigmaRE (r1 `Choice` r2)
>             env' = ((r1,r2),Var cnt):env
>             r1l l = resToRE $ partDeriv r1 l
>             r2l l = resToRE $ partDeriv r2 l
>             r' = resToRE $ map (\l -> Seq (L l) (intersectC (cnt+1) env' (r1l l) (r2l l))) letters
>             r =  if (isEmpty r1) && (isEmpty r2)
>                  then Choice r' Empty
>                  else r'
>         in convert cnt r

For testing purposes, it's handy to have a function which tests for (semantic) equality among regular expressions (again written using partial derivatives).

> type EnvEq a = [(RE a, RE a)]

> eqRE :: Eq a => RE a -> RE a -> Bool
> eqRE r1 r2 = eqREC [] r1 r2

> eqREC :: Eq a => EnvEq a -> RE a -> RE a -> Bool
> eqREC env r1 r2 
>   | isEmpty r1 && (not (isEmpty r2)) = False
>   | isPhi r1 && (not (isPhi r2)) = False
>   | otherwise =
>      if elem (r1,r2) env
>      then True
>      else let letters = sigmaRE (r1 `Choice` r2)
>               env' = (r1,r2):env
>               r1l l = resToRE $ partDeriv r1 l
>               r2l l = resToRE $ partDeriv r2 l
>           in and $ map (\l -> eqREC env' (r1l l) (r2l l)) letters

> containsRECheap :: Eq a => RE a -> RE a -> Bool
> containsRECheap r1 r2 = eqRE r1 (intersectRE r1 r2)

> containsRE :: Eq a => RE a -> RE a -> Bool
> containsRE r1 r2 = containsC [] r1 r2

> containsC :: Eq a => EnvEq a -> RE a -> RE a -> Bool
> containsC env r1 r2
>   | r1 == Empty = isEmpty r2
>   | r1 == Phi = True
>   | r2 == Phi = eqRE r1 Phi       -- I think r1 == Phi should be fine
>   | r2 == Empty = eqRE r1 Empty   -- same here
>   | elem (r1,r2) env = True
>   | otherwise = 
>         let letters = sigmaRE (r1 `Choice` r2)
>             env' = (r1,r2) :env
>             r1l l = resToRE $ partDeriv r1 l
>             r2l l = resToRE $ partDeriv r2 l
>             b = and $ map (\l -> containsC env' (r1l l) (r2l l)) letters
>         in b