{-| Implementation of the GLL parsing algorithm [Scott and Johnstone 2010,2013,2016] with the grammar as an explicit parameter. Function 'parse' receives a 'Grammar' as input together with a list of tokens (the input string). The type of token is chosen arbitrarily, but the type should be 'Parseable' and 'Ord'erable. To be 'Parseable' a type must have two distinct values, 'eos' (end-of-string) and 'eps' (epsilon). The user must ensure that these two values will never occur as part of the input string. == GLL Parsing === Recursive Descent GLL parsing is a generalisation of recursive descent parsing (RD parsing). A RD parser (RDP), for some grammar 'G' , consists of a set of parse functions 'f_X', one for every nonterminal 'X', and a main function which calls 'f_S', where 'S' is the start symbol. The parse function 'f_X' take an integer 'l' as an argument and produces an integer 'r', indicating that nonterminal 'X' derives 's_l_r', where 's_i_j' is the substring of the input string 's' ranging from 'i' to 'j'. We call 'l' and 'r' the right- and left-extent, respectively. The parse function 'f_X' has a branch for every production X ::= s_1 ... s_k in 'G', guarded by a look-ahead test, and every branch has 'k' code fragments, one for every symbol 's_i', with 1 <= i <= k. A RDP matches grammar positions, represented by /slots/ of the form X ::= a . b, with (input) string positions. The dot in a slot tells us how much of the production's symbols have been matched (the symbols before the dot) and which symbols still need to be matched (the symbols after the dot). The symbol immediately after the dot is the next symbol to be match and is either: * A terminal token, matched directly with the token at the current string position. * A nonterminal 'Y', for which 'f_Y' is called. In the case of LL(1) deterministic parsing, only one (or none) of the productions of 'Y' passes the lookahead-test, say "Y ::= c", and its branch will be executed: the next grammar position is "Y ::= .c". * No further symbol, represented by "X ::= d." (all symbols have been processed). In this case a return call is made to the caller of 'f_X' (relying on a function call stack). === Handling function/return calls GLL handles its own function calls and return calls, instead of relying on an underlying mechanism. This form of low-level control allows GLL to avoid much duplicate work, not only for function calls (as in classical memoisation) but also for return calls. The underlying observation is that both return calls and function calls continue matching grammar slots. In non-deterministic RDP, every function call leads to a slot of the form "X ::= . a" being processed, while every return call leads to a slot of the form "X ::= aY.b" being processed, where 'Y' is some nonterminal. GLL uses /descriptors/, containing a slot of one of these forms, to uniquely identify the computation that processes the slot. The descriptor therefore also needs to contain the initial values of the local variables used in that computation. A generated GLL parser (Scott and Johnstone 2013) has a code fragment for every nonterminal 'X' (labelled 'L_X') and slot (labelled "L_{X ::= a.b}"). This Haskell implementation abstracts over the grammar and has a function for executing 'L_X', for a given 'X', and a function for executing "L_{X ::= a.b}", for a given "X ::= a.b". === Generalisation GLL parsing generalises RD parsing by allowing non-determinism: when processing "X ::= a.Yb", all productions of 'Y', that pass the lookahead test, are considered. A slot is considered, by adding a descriptor for it to the /worklist/ 'R'. Duplicates in the worklist are avoided by maintaining a separate descriptor-set 'U' containing all descriptors added to the worklist before. The result of a parse function 'f_X' is no longer a single right extent 'r'. Instead, it is a list of right extents 'rs', indicating that 'X' derives 's_l_r' for all 'r' in 'rs' and integer input 'l' (left extent). Every discovered right extent is stored in the /pop-set/ 'P'. When a descriptors for a function call is a duplicate, it is not added to the worklist, but we have to make sure that the corresponding return call is still made. Note that a function call to 'f_Y', with the same parameters, can be made from multiple right-hand side occurrences of 'Y'. It might be the case that: * The original descriptors is still being processed. Once finished, a descriptor must be added for all return calls corresponding to function calls that lead to duplicates of this descriptor. GLL uses a Graph-Structured Stack (GSS) to efficiently maintain multiple such continuations. * The original descriptors has already been processed. In this case, one or more right extents 'rs' are stored in 'P' for the corresponding function call. A descriptor for the return call must be added for all 'r' in 'rs'. The descriptor for the return call must be added to the GSS in this case as well, as other right extents might be found in the future. == Usage This module provides generalised parsing to other applications that work with BNF grammars. The user should provide a 'Grammar' and an input string as arguments to top-level functions 'parse' or 'parseWithOptions'. === Example This example shows simple character level parsing. First we must make 'Char' and instance of 'Parseable'. @ instance Parseable Char where eos = \'$\' eps = '#' @ This instance mandates that \'$\' and '#' are 'reserved tokens' and not part of the input string. This instance is available as an import: "GLL.Parseable.Char". "GLL.Parser" exports smart constructors for constructing 'Grammar's. @ grammar1 = (start \"X\" , [prod \"X\" [nterm \"A\", nterm \"A\"] , prod \"A\" [term \'a\'] , prod \"A\" [term \'a\', term \'a\'] ] ) fail1 = "a" success1 = "aa" success2 = "aaa" fail2 = "aaaaa" @ Note that there are two possible derivations of 'success2'. The parser can be accessed through 'parse' or 'parseWithOptions'. @ run1 = parse grammar1 success1 run2 = parseWithOptions [fullSPPF, strictBinarisation] grammar1 success2 @ The options 'fullSPPF', 'allNodes', 'packedNodesOnly', decide whether all SPPF nodes and edges are inserted into the resulting value of the 'SPPF' type. Packed nodes are enough to fully represent an SPPF, as the parent and children of a packed node can be computed from the packed nodes' information. For efficiency the 'SPPF' is not strictly binarised by default: a packed node might have a symbol node as a left child. In a strictly binarised 'SPPF' a packed node has an intermediate node as a left child, or no left child at all. To create a strictly binarised 'SPPF' (necessary for "GLL.Combinators") the option 'strictBinarisation' is available. === Combinator interface Module "GLL.Combinators.Interface" provides a combinator interface to access "GLL.Parser". Applicative-like combinators are used to specify a 'Grammar' and call 'parse'. The 'SPPF' is then used to produce semantic results. -} module GLL.Parser ( -- * Grammar Grammar(..), Prods(..), Prod(..), Symbols(..), Symbol(..), Slot(..), -- ** Smart constructors for creating 'Grammar's start, prod, nterm, term, -- ** Parseable tokens Parseable(..), Input, mkInput, -- * Run the GLL parser parse, parseArray, -- ** Run the GLL parser with options parseWithOptions, parseWithOptionsArray, -- *** ParseOptions ParseOptions, ParseOption, strictBinarisation, fullSPPF, allNodes, packedNodesOnly, maximumErrors, noSelectTest, -- ** Result ParseResult(..), SPPF(..), SPPFNode(..), SymbMap, ImdMap, PackMap, EdgeMap, showSPPF, ) where import Data.Foldable hiding (forM_, toList, sum) import Prelude hiding (lookup, foldr, fmap, foldl, elem, any, concatMap) import Control.Applicative import Control.Monad import qualified Data.IntMap as IM import qualified Data.Map as M import qualified Data.Array as A import qualified Data.Set as S import qualified Data.IntSet as IS import Data.Text (pack) import Text.PrettyPrint.HughesPJ as PP import GLL.Types.Grammar import GLL.Types.Derivations import GLL.Flags -- | Create an 'Nt' (nonterminal) from a String. string2nt :: String -> Nt string2nt = pack -- | A smart constructor for creating a start 'Nt' (nonterminal). start :: String -> Nt start = string2nt -- | A smart constructor for creating a 'Prod' (production). prod :: String -> Symbols t -> Prod t prod x = Prod (string2nt x) -- | A smart constructor for creating a nonterminal 'Symbol'. nterm :: String -> Symbol t nterm = Nt . string2nt -- | A smart constructor for creating a terminal 'Symbol'. term :: t -> Symbol t term = Term -- | Representation of the input string type Input t = A.Array Int t mkInput :: (Parseable t) => [t] -> Input t mkInput input = A.listArray (0,m) (input++[eos]) where m = length input -- | Types for type LhsParams t = (Nt, Int) type RhsParams t = (Slot t, Int, Int) -- | The worklist and descriptor set type Rcal t = [(RhsParams t, SPPFNode t)] type Ucal t = IM.IntMap (IM.IntMap (S.Set (Slot t))) -- | GSS representation type GSS t = IM.IntMap (M.Map Nt [GSSEdge t]) type GSSEdge t = (Slot t, Int, SPPFNode t) -- return position, left extent type GSSNode t = (Nt, Int) type MisMatches t = IM.IntMap (S.Set t) -- | Pop-set type Pcal t = IM.IntMap (M.Map Nt [Int]) -- | Connecting it all data Mutable t = Mutable { mut_sppf :: SPPF t , mut_worklist :: Rcal t , mut_descriptors :: Ucal t , mut_gss :: GSS t , mut_popset :: Pcal t , mut_mismatches :: MisMatches t , mut_counters :: Counters } data Counters = Counters { count_successes :: Int , count_pnodes :: Int } -- | Monad for implicitly passing around 'context' data GLL t a = GLL (Flags -> Mutable t -> (a, Mutable t)) runGLL :: GLL t a -> Flags -> Mutable t -> Mutable t runGLL (GLL f) o p = snd $ f o p addSPPFEdge f t = GLL $ \flags mut -> let sppf' = (if symbol_nodes flags then sNodeInsert f t else id) $ (if intermediate_nodes flags then iNodeInsert f t else id) $ (if edges flags then eMapInsert f t else id) $ pMapInsert f t (mut_sppf mut) in ((),mut{mut_sppf = sppf'}) addDescr sppf alt@(slot,i,l) = GLL $ \_ mut -> let new = maybe True inner $ IM.lookup i (mut_descriptors mut) where inner m = maybe True (not . (slot `S.member`)) $ IM.lookup l m newU = IM.alter inner i (mut_descriptors mut) where inner mm = case mm of Nothing -> Just $ IM.singleton l single Just m -> Just $ IM.insertWith (S.union) l single m single = S.singleton slot in if new then ((), mut{mut_worklist = (alt,sppf):(mut_worklist mut) ,mut_descriptors = newU}) else ((), mut) getDescr = GLL $ \_ mut -> case mut_worklist mut of [] -> (Nothing, mut) (next@(alt,sppf):rest) -> (Just next, mut{mut_worklist = rest}) addPop (gs,l) i = GLL $ \_ mut -> let newP = IM.alter inner l (mut_popset mut) where inner mm = case mm of Nothing -> Just $ M.singleton gs [i] Just m -> Just $ M.insertWith (++) gs [i] m in ((), mut{mut_popset = newP}) getChildren (gs,l) = GLL $ \_ mut -> let res = maybe [] inner $ IM.lookup l (mut_gss mut) where inner m = maybe [] id $ M.lookup gs m in (res, mut) addGSSEdge f@(gs,i) t = GLL $ \_ mut -> let newGSS = IM.alter inner i (mut_gss mut) where inner mm = case mm of Nothing -> Just $ M.singleton gs [t] Just m -> Just $ M.insertWith (++) gs [t] m in ((), mut{mut_gss = newGSS}) getPops (gs,l) = GLL $ \_ mut -> let res = maybe [] inner $ IM.lookup l (mut_popset mut) where inner = maybe [] id . M.lookup gs in (res, mut) addSuccess = GLL $ \_ mut -> let mut' = mut { mut_counters = counters { count_successes = 1 + count_successes counters } } counters = mut_counters mut in ((),mut') getFlags = GLL $ \fs ctx -> (fs, ctx) addMisMatch :: (Ord t) => Int -> S.Set t -> GLL t () addMisMatch k ts = GLL $ \flags mut -> let newM = IM.insertWith S.union k ts (mut_mismatches mut) newM' | length (IM.keys newM) > max_errors flags = IM.deleteMin newM | otherwise = newM in ((), mut{mut_mismatches = newM'}) instance (Show t) => Show (SPPFNode t) where show (SNode (s, l, r)) = "(s: " ++ show s ++ ", " ++ show l ++ ", " ++ show r ++ ")" show (INode (s, l, r)) = "(i: " ++ show s ++ ", " ++ show l ++ ", " ++ show r ++ ")" show (PNode (p, l, k, r)) = "(p: " ++ show p ++ ", " ++ show l ++ ", " ++ show k ++ ", " ++ show r ++ ")" show Dummy = "$" instance Applicative (GLL t) where (<*>) = ap pure = return instance Functor (GLL t) where fmap = liftM instance Monad (GLL t) where return a = GLL $ \_ p -> (a, p) (GLL m) >>= f = GLL $ \o p -> let (a, p') = m o p (GLL m') = f a in m' o p' -- | -- Run the GLL parser given a 'Grammar' 't' and a list of 't's, -- where 't' is an arbitrary token-type. -- All token-types must be 'Parseable'. parse :: (Parseable t) => Grammar t -> [t] -> ParseResult t parse = parseWithOptions [] -- | -- Run the GLL parser given a 'Grammar' 't' and an 'Array' of 't's, -- where 't' is an arbitrary token-type. -- All token-types must be 'Parseable'. parseArray :: (Parseable t) => Grammar t -> Input t -> ParseResult t parseArray = parseWithOptionsArray [] -- | -- Variant of 'parseWithOptionsArray' where the input is a list of 'Parseable's rather than an 'Array' parseWithOptions :: Parseable t => ParseOptions -> Grammar t -> [t] -> ParseResult t parseWithOptions opts gram = parseWithOptionsArray opts gram . mkInput -- | -- Run the GLL parser given some options, a 'Grammar' 't' and a list of 't's. -- -- If no options are given a minimal 'SPPF' will be created: -- -- * only packed nodes are created -- * the resulting 'SPPF' is not strictly binarised parseWithOptionsArray :: Parseable t => ParseOptions -> Grammar t -> Input t -> ParseResult t parseWithOptionsArray opts grammar@(start,_) input = let flags = runOptions opts (mutable,_,_) = gll flags m False grammar input (_, m) = A.bounds input in resultFromMutable input flags mutable (Nt start, 0, m) gll :: Parseable t => Flags -> Int -> Bool -> Grammar t -> Input t -> (Mutable t, SelectMap t, FollowMap t) gll flags m debug (start, prods) input = (runGLL (pLhs (start, 0)) flags context, selects, follows) where context = Mutable emptySPPF [] IM.empty IM.empty IM.empty IM.empty counters counters = Counters 0 0 dispatch = do mnext <- getDescr case mnext of Nothing -> return () -- no continuation Just (next,sppf) -> pRhs next sppf pLhs (bigx, i) = do let alts = [ ((Slot bigx [] beta, i, i), first_ts) | Prod bigx beta <- altsOf bigx , let first_ts = select beta bigx ] first_ts = S.unions (map snd alts) cands = [ descr | (descr, first_ts) <- alts , select_test (input A.! i) first_ts ] if null cands then addMisMatch i first_ts else forM_ cands (addDescr Dummy) dispatch pRhs (Slot bigx alpha ((Term tau):beta), i, l) sppf = if (input A.! i `matches` tau) then do -- token test successful root <- joinSPPFs slot sppf l i (i+1) pRhs (slot, i+1, l) root else do addMisMatch i (S.singleton tau) dispatch where slot = Slot bigx (alpha++[Term tau]) beta pRhs (Slot bigx alpha ((Nt bigy):beta), i, l) sppf = if select_test (input A.! i) first_ts then do addGSSEdge ret (slot,l,sppf) rs <- getPops ret -- has ret been popped? forM_ rs $ \r -> do -- yes, use given extents root <- joinSPPFs slot sppf l i r addDescr root (slot, r, l) pLhs (bigy, i) else do addMisMatch i first_ts dispatch where ret = (bigy, i) slot = Slot bigx (alpha++[Nt bigy]) beta first_ts = select ((Nt bigy):beta) bigx pRhs (Slot bigy alpha [], i, l) sppf | bigy == start && l == 0 = if i == m then addSuccess >> dispatch else addMisMatch i (S.singleton eos) >> dispatch pRhs (Slot bigx alpha [], i, l) Dummy = do root <- joinSPPFs slot Dummy l i i pRhs (slot, i, l) root where slot = Slot bigx [] [] pRhs (Slot bigy alpha [], i, l) ynode = do addPop (bigy,l) i returns <- getChildren (bigy,l) forM_ returns $ \(gs',l',sppf) -> do root <- joinSPPFs gs' sppf l' l i -- create SPPF for lhs addDescr root (gs', i, l') -- add new descriptors dispatch (prodMap,_,_,follows,selects) | do_select_test flags = fixedMaps start prods | otherwise = (pmap, undefined, undefined, undefined, error "select-tests are switched off") where pmap = M.fromListWith (++) [ (x,[pr]) | pr@(Prod x _) <- prods ] follow x = follows M.! x do_test = do_select_test flags select rhs x | do_test = selects M.! (x,rhs) | otherwise = S.empty where select_test t set | do_test = any (matches t) set | otherwise = True altsOf x = prodMap M.! x merge m1 m2 = IM.unionWith inner m1 m2 where inner = IM.unionWith S.union count_pnode :: GLL t () count_pnode = GLL $ \flags mut -> let mut' = mut { mut_counters = mut_counters' (mut_counters mut) } where mut_counters' counters = counters { count_pnodes = count_pnodes counters + 1 } in ((), mut') joinSPPFs (Slot bigx alpha beta) sppf l k r = do flags <- getFlags case (flexible_binarisation flags, sppf, beta) of (True,Dummy, _:_) -> return snode (_,Dummy, []) -> do addSPPFEdge xnode pnode addSPPFEdge pnode snode count_pnode return xnode (_,_, []) -> do addSPPFEdge xnode pnode addSPPFEdge pnode sppf addSPPFEdge pnode snode count_pnode return xnode _ -> do addSPPFEdge inode pnode addSPPFEdge pnode sppf addSPPFEdge pnode snode count_pnode return inode where x = last alpha -- symbol before the dot snode = SNode (x, k, r) xnode = SNode (Nt bigx, l, r) inode = INode ((Slot bigx alpha beta), l, r) pnode = PNode ((Slot bigx alpha beta), l, k, r) -- | -- The "ParseResult" datatype contains the "SPPF" and some other -- information about the parse: -- -- * 'SPPF' -- * Whether the parse was successful -- * The number of descriptors that have been processed -- * The number of symbol nodes (nonterminal and terminal) -- * The number of intermediate noes -- * The number of packed nodes -- * The number of GSS nodes -- * The number of GSS edges data ParseResult t = ParseResult{ sppf_result :: SPPF t , res_success :: Bool , res_successes :: Int , nr_descriptors :: Int , nr_nterm_nodes :: Int , nr_term_nodes :: Int , nr_intermediate_nodes :: Int , nr_packed_nodes :: Int , nr_packed_node_attempts :: Int , nr_sppf_edges :: Int , nr_gss_nodes :: Int , nr_gss_edges :: Int , error_message :: String } resultFromMutable :: Parseable t => Input t -> Flags -> Mutable t -> SNode t -> ParseResult t resultFromMutable inp flags mutable s_node@(s, l, m) = let u = mut_descriptors mutable gss = mut_gss mutable usize = sum [ S.size s | (l, r2s) <- IM.assocs u , (r,s) <- IM.assocs r2s ] s_nodes = sum [ S.size s | (l, r2s) <- IM.assocs sMap , (r, s) <- IM.assocs r2s ] i_nodes = sum [ S.size s | (l, r2s) <- IM.assocs iMap , (r, s) <- IM.assocs r2s ] p_nodes = sum [ IS.size ks | (l, r2j) <- IM.assocs pMap , (r, j2s) <- IM.assocs r2j , (j, s2k) <- IM.assocs j2s , (s, ks) <- M.assocs s2k ] sppf_edges = sum [ S.size ts | (_, ts) <- M.assocs eMap ] gss_nodes = 1 + sum [ length $ M.keys x2s| (l,x2s) <- IM.assocs gss] gss_edges = 1 + sum [ length s | (l,x2s) <- IM.assocs gss , (x,s) <- M.assocs x2s ] sppf@(sMap, iMap, pMap, eMap) = mut_sppf mutable successes = count_successes (mut_counters mutable) in ParseResult sppf (successes > 0) successes usize s_nodes m i_nodes p_nodes (count_pnodes (mut_counters mutable)) sppf_edges gss_nodes gss_edges (renderErrors inp flags (mut_mismatches mutable)) renderErrors :: Parseable t => Input t -> Flags -> MisMatches t -> String renderErrors inp flags mm = render doc where n = max_errors flags locs = reverse (IM.assocs mm) doc = text ("Unsuccessful parse, showing "++ show n ++ " furthest matches") $+$ foldr (\loc -> (ppLoc loc $+$)) PP.empty locs ppLoc (k, ts) = text ("did not match at position " ++ show k ++ ", where we find " ++ lexeme) $+$ nest 4 (text "Found" <+> ppExp token) $+$ nest 4 (text "expected:") $+$ nest 8 (vcat (map ppExp (S.toList ts))) where token = inp A.! k lexeme = concatMap unlex (take 5 (drop k (A.elems inp))) ppExp t = text (unlex t) <+> text "AKA" <+> text (show t) instance Show (ParseResult t) where show res | res_success res = result_string | otherwise = result_string ++ "\n" ++ error_message res where result_string = unlines $ [ "Success " ++ show (res_success res) , "#Success " ++ show (res_successes res) , "Descriptors: " ++ show (nr_descriptors res) , "Nonterminal nodes: " ++ show (nr_nterm_nodes res) , "Terminal nodes: " ++ show (nr_term_nodes res) , "Intermediate nodes: " ++ show (nr_intermediate_nodes res) , "Packed nodes: " ++ show (nr_packed_nodes res) , "SPPF edges: " ++ show (nr_sppf_edges res) , "GSS nodes: " ++ show (nr_gss_nodes res) , "GSS edges: " ++ show (nr_gss_edges res) ]