Safe Haskell | Safe |
---|---|
Language | Haskell2010 |
Implementation of the GLL parsing algorithm [Scott and Johnstone 2010,2013,2016] with the grammar as an explicit parameter.
Function parse
receives a Grammar
as input together with a
list of tokens (the input string).
The type of token is chosen arbitrarily, but the type should be Parseable
and Ord
erable.
To be Parseable
a type must have two distinct values, eos
(end-of-string)
and eps
(epsilon). The user must ensure that these two values will never occur
as part of the input string.
GLL Parsing
Recursive Descent
GLL parsing is a generalisation of recursive descent parsing (RD parsing).
A RD parser (RDP), for some grammar G
, consists of a set of parse
functions f_X
, one for every nonterminal X
, and a main function which
calls f_S
, where S
is the start symbol.
The parse function f_X
take an integer l
as an argument and produces an
integer r
, indicating that nonterminal X
derives s_l_r
,
where s_i_j
is the substring of the input string s
ranging from
i
to j
. We call l
and r
the right- and left-extent, respectively.
The parse function f_X
has a branch for every production X ::= s_1 ... s_k in G
, guarded
by a look-ahead test, and every
branch has k
code fragments, one for every symbol s_i
,
with 1 <= i <= k.
A RDP matches grammar positions, represented by slots of the form
X ::= a . b, with (input) string positions.
The dot in a slot tells us how much of the production's symbols have been
matched (the symbols before the dot) and which symbols still need to
be matched (the symbols after the dot). The symbol immediately after the dot
is the next symbol to be match and is either:
- A terminal token, matched directly with the token at the current string position.
- A nonterminal
Y
, for whichf_Y
is called. In the case of LL(1) deterministic parsing, only one (or none) of the productions ofY
passes the lookahead-test, say "Y ::= c", and its branch will be executed: the next grammar position is "Y ::= .c". - No further symbol, represented by "X ::= d." (all
symbols have been processed). In this case a return call is made
to the caller of
f_X
(relying on a function call stack).
Handling function/return calls
GLL handles its own function calls and return calls, instead of relying on an
underlying mechanism. This form of low-level control allows
GLL to avoid much duplicate work, not only for function calls (as in classical
memoisation) but also for return calls. The underlying observation is that
both return calls and function calls continue matching grammar slots.
In non-deterministic RDP, every function call leads to a slot of the
form "X ::= . a" being processed, while every return call
leads to a slot of the form "X ::= aY.b" being processed,
where Y
is some nonterminal. GLL uses descriptors, containing
a slot of one of these forms, to uniquely identify the computation that
processes the slot. The descriptor therefore also needs to contain
the initial values of the local variables used in that computation.
A generated GLL parser (Scott and Johnstone 2013) has a code fragment for
every nonterminal X
(labelled L_X
) and slot (labelled "L_{X ::= a.b}").
This Haskell implementation abstracts over the grammar and has a function for
executing L_X
, for a given X
, and a function for executing
"L_{X ::= a.b}", for a given "X ::= a.b".
Generalisation
GLL parsing generalises RD parsing by allowing non-determinism:
when processing "X ::= a.Yb", all productions of Y
, that pass
the lookahead test, are considered. A slot is considered, by adding a
descriptor for it to the worklist R
.
Duplicates in the worklist are avoided by maintaining a separate descriptor-set
U
containing all descriptors added to the worklist before.
The result of a parse function f_X
is no longer a single right extent r
.
Instead, it is a list of right extents rs
, indicating that X
derives
s_l_r
for all r
in rs
and integer input l
(left extent).
Every discovered right extent is stored in the pop-set P
.
When a descriptors for a function call is a duplicate, it is not added to the
worklist, but we have to make sure that the corresponding
return call is still made. Note that a function call to f_Y
, with
the same parameters, can be made from multiple right-hand side occurrences
of Y
. It might be the case that:
- The original descriptors is still being processed. Once finished, a descriptor must be added for all return calls corresponding to function calls that lead to duplicates of this descriptor. GLL uses a Graph-Structured Stack (GSS) to efficiently maintain multiple such continuations.
- The original descriptors has already been processed. In this
case, one or more right extents
rs
are stored inP
for the corresponding function call. A descriptor for the return call must be added for allr
inrs
. The descriptor for the return call must be added to the GSS in this case as well, as other right extents might be found in the future.
Usage
This module provides generalised parsing to other applications that work with BNF grammars.
The user should provide a Grammar
and an input string as arguments
to top-level functions parse
or parseWithOptions
.
Example
This example shows simple character level parsing.
First we must make Char
and instance of Parseable
.
instance Parseable Char where
eos = '$'
eps = #
This instance mandates that '$' and #
are 'reserved tokens'
and not part of the input string. This instance is available as an import:
GLL.Parseable.Char.
GLL.Parser exports smart constructors for constructing Grammar
s.
grammar1 = (start "X" , [prod "X" [nterm "A", nterm "A"] , prod "A" [term 'a'] , prod "A" [term 'a', term 'a'] ] ) fail1 = "a" success1 = "aa" success2 = "aaa" fail2 = "aaaaa"
Note that there are two possible derivations of success2
.
The parser can be accessed through parse
or parseWithOptions
.
run1 = parse grammar1 success1 run2 = parseWithOptions [fullSPPF, strictBinarisation] grammar1 success2
The options fullSPPF
, allNodes
, packedNodesOnly
, decide whether all SPPF nodes and
edges are inserted into the resulting value of the SPPF
type.
Packed nodes are enough to fully represent an SPPF, as the parent and children
of a packed node can be computed from the packed nodes' information.
For efficiency the SPPF
is not strictly binarised by default: a packed
node might have a symbol node as a left child. In a strictly binarised SPPF
a packed node has an intermediate node as a left child, or no left child at all.
To create a strictly binarised SPPF
(necessary for GLL.Combinators) the option
strictBinarisation
is available.
Combinator interface
Module GLL.Combinators.Interface provides a combinator interface to access
GLL.Parser. Applicative-like combinators are used to specify a Grammar
and
call parse
. The SPPF
is then used to produce semantic results.
Synopsis
- type Grammar t = (Nt, Prods t)
- type Prods t = [Prod t]
- data Prod t = Prod Nt (Symbols t)
- type Symbols t = [Symbol t]
- data Symbol t
- data Slot t = Slot Nt [Symbol t] [Symbol t]
- start :: String -> Nt
- prod :: String -> Symbols t -> Prod t
- nterm :: String -> Symbol t
- term :: t -> Symbol t
- class (Ord a, Eq a, Show a) => Parseable a where
- type Input t = Array Int t
- mkInput :: Parseable t => [t] -> Input t
- parse :: Parseable t => Grammar t -> [t] -> ParseResult t
- parseArray :: Parseable t => Grammar t -> Input t -> ParseResult t
- parseWithOptions :: Parseable t => ParseOptions -> Grammar t -> [t] -> ParseResult t
- parseWithOptionsArray :: Parseable t => ParseOptions -> Grammar t -> Input t -> ParseResult t
- type ParseOptions = [ParseOption]
- type ParseOption = Flags -> Flags
- strictBinarisation :: ParseOption
- fullSPPF :: ParseOption
- allNodes :: ParseOption
- packedNodesOnly :: ParseOption
- maximumErrors :: Int -> ParseOption
- noSelectTest :: ParseOption
- data ParseResult t = ParseResult {}
- type SPPF t = (SymbMap t, ImdMap t, PackMap t, EdgeMap t)
- data SPPFNode t
- type SymbMap t = IntMap (IntMap (Set (Symbol t)))
- type ImdMap t = IntMap (IntMap (Set (Slot t)))
- type PackMap t = IntMap (IntMap (IntMap (Map (Prod t) IntSet)))
- type EdgeMap t = Map (SPPFNode t) (Set (SPPFNode t))
- showSPPF :: Show t => SPPF t -> String
Grammar
A production binds a nonterminal identifier (left-hand side) to a list of symbols (the right-hand side of the production).
A Symbol
is either a nonterminal or a terminal,
where a terminal contains some arbitrary token.
A grammar slot acts as a label to identify progress of matching a production. As such, a slot is a Prod with its right-hand side split in two: a part before and a part after 'the dot'. The dot indicates which part of the right-hand side has been processed thus far.
Smart constructors for creating Grammar
s
Parseable tokens
class (Ord a, Eq a, Show a) => Parseable a where Source #
Class that captures elements of an input string (tokens).
Both eos
and eps
must be distinct from eachother and from all
tokens in the input string.
The show instance is required to throw error messages.
matches :: a -> a -> Bool Source #
This function is used for matching grammar tokens and input tokens. Override this method if, for example, your input tokens store lexemes while the grammar tokens do not
This function pretty-prints the Parseable type by displaying its lexeme.
Default implementation is show
, which should be replaced for prettier error messages.
Run the GLL parser
parseArray :: Parseable t => Grammar t -> Input t -> ParseResult t Source #
Run the GLL parser with options
parseWithOptions :: Parseable t => ParseOptions -> Grammar t -> [t] -> ParseResult t Source #
Variant of parseWithOptionsArray
where the input is a list of Parseable
s rather than an Array
parseWithOptionsArray :: Parseable t => ParseOptions -> Grammar t -> Input t -> ParseResult t Source #
ParseOptions
type ParseOptions = [ParseOption] Source #
A list of ParserOption
s
strictBinarisation :: ParseOption Source #
Fully binarise the SPPF, resulting in a larger SPPF
and possibly slower runtimes.
When this flag is on, packed nodes can only have a single symbol node child
or one intermediate node child and one symbol node child.
With the flag disabled a packed node can have two symbol node children.
fullSPPF :: ParseOption Source #
Create the SPPF
with all nodes and edges, not necessarily strictly binarised.
allNodes :: ParseOption Source #
Create all nodes, but no edges between nodes.
packedNodesOnly :: ParseOption Source #
Create packed-nodes only.
maximumErrors :: Int -> ParseOption Source #
Set the maximum number of errors shown in case of an unsuccessful parse.
noSelectTest :: ParseOption Source #
Turn of select tests. Disables lookahead.
Result
data ParseResult t Source #
The ParseResult datatype contains the SPPF and some other information about the parse:
SPPF
- Whether the parse was successful
- The number of descriptors that have been processed
- The number of symbol nodes (nonterminal and terminal)
- The number of intermediate noes
- The number of packed nodes
- The number of GSS nodes
- The number of GSS edges
ParseResult | |
|
Instances
Show (ParseResult t) Source # | |
Defined in GLL.Parser showsPrec :: Int -> ParseResult t -> ShowS # show :: ParseResult t -> String # showList :: [ParseResult t] -> ShowS # |
An SPPFNode is either a symbol node, an intermediate node, a packed node or a dummy.
Instances
Eq t => Eq (SPPFNode t) Source # | |
Ord t => Ord (SPPFNode t) Source # | |
Show t => Show (SPPFNode t) # | |
type SymbMap t = IntMap (IntMap (Set (Symbol t))) Source #
Stores symbol nodes using nested Data.IntMaps, nesting is as follows:
- left extent
- right extent
- set of symbols
type ImdMap t = IntMap (IntMap (Set (Slot t))) Source #
Stores intermediate nodes using nested Data.IntMaps, nesting is as follows:
- left extent
- right extent
- set of slots
type PackMap t = IntMap (IntMap (IntMap (Map (Prod t) IntSet))) Source #
Stores packed nodes using nested Data.IntMaps, nesting is as follows:
- left extent
- right extent
- dot position (from left to right)
- mapping from productions to set of pivots