gll-0.4.0.5: GLL parser with simple combinator interface

GLL.Parser

Description

Implementation of the GLL parsing algorithm [Scott and Johnstone 2010,2013,2016] with the grammar as an explicit parameter.

Function parse receives a Grammar as input together with a list of tokens (the input string).

The type of token is chosen arbitrarily, but the type should be Parseable and Orderable. To be Parseable a type must have two distinct values, eos (end-of-string) and eps (epsilon). The user must ensure that these two values will never occur as part of the input string.

## GLL Parsing

### Recursive Descent

GLL parsing is a generalisation of recursive descent parsing (RD parsing). A RD parser (RDP), for some grammar G , consists of a set of parse functions f_X, one for every nonterminal X, and a main function which calls f_S, where S is the start symbol. The parse function f_X take an integer l as an argument and produces an integer r, indicating that nonterminal X derives s_l_r, where s_i_j is the substring of the input string s ranging from i to j. We call l and r the right- and left-extent, respectively.

The parse function f_X has a branch for every production X ::= s_1 ... s_k in G, guarded by a look-ahead test, and every branch has k code fragments, one for every symbol s_i, with 1 <= i <= k. A RDP matches grammar positions, represented by slots of the form X ::= a . b, with (input) string positions. The dot in a slot tells us how much of the production's symbols have been matched (the symbols before the dot) and which symbols still need to be matched (the symbols after the dot). The symbol immediately after the dot is the next symbol to be match and is either:

• A terminal token, matched directly with the token at the current string position.
• A nonterminal Y, for which f_Y is called. In the case of LL(1) deterministic parsing, only one (or none) of the productions of Y passes the lookahead-test, say "Y ::= c", and its branch will be executed: the next grammar position is "Y ::= .c".
• No further symbol, represented by "X ::= d." (all symbols have been processed). In this case a return call is made to the caller of f_X (relying on a function call stack).

### Handling function/return calls

GLL handles its own function calls and return calls, instead of relying on an underlying mechanism. This form of low-level control allows GLL to avoid much duplicate work, not only for function calls (as in classical memoisation) but also for return calls. The underlying observation is that both return calls and function calls continue matching grammar slots. In non-deterministic RDP, every function call leads to a slot of the form "X ::= . a" being processed, while every return call leads to a slot of the form "X ::= aY.b" being processed, where Y is some nonterminal. GLL uses descriptors, containing a slot of one of these forms, to uniquely identify the computation that processes the slot. The descriptor therefore also needs to contain the initial values of the local variables used in that computation.

A generated GLL parser (Scott and Johnstone 2013) has a code fragment for every nonterminal X (labelled L_X) and slot (labelled "L_{X ::= a.b}"). This Haskell implementation abstracts over the grammar and has a function for executing L_X, for a given X, and a function for executing "L_{X ::= a.b}", for a given "X ::= a.b".

### Generalisation

GLL parsing generalises RD parsing by allowing non-determinism: when processing "X ::= a.Yb", all productions of Y, that pass the lookahead test, are considered. A slot is considered, by adding a descriptor for it to the worklist R. Duplicates in the worklist are avoided by maintaining a separate descriptor-set U containing all descriptors added to the worklist before.

The result of a parse function f_X is no longer a single right extent r. Instead, it is a list of right extents rs, indicating that X derives s_l_r for all r in rs and integer input l (left extent). Every discovered right extent is stored in the pop-set P.

When a descriptors for a function call is a duplicate, it is not added to the worklist, but we have to make sure that the corresponding return call is still made. Note that a function call to f_Y, with the same parameters, can be made from multiple right-hand side occurrences of Y. It might be the case that:

• The original descriptors is still being processed. Once finished, a descriptor must be added for all return calls corresponding to function calls that lead to duplicates of this descriptor. GLL uses a Graph-Structured Stack (GSS) to efficiently maintain multiple such continuations.
• The original descriptors has already been processed. In this case, one or more right extents rs are stored in P for the corresponding function call. A descriptor for the return call must be added for all r in rs. The descriptor for the return call must be added to the GSS in this case as well, as other right extents might be found in the future.

## Usage

This module provides generalised parsing to other applications that work with BNF grammars.

The user should provide a Grammar and an input string as arguments to top-level functions parse or parseWithOptions.

### Example

This example shows simple character level parsing. First we must make Char and instance of Parseable.

instance Parseable Char where
eos = '$' eps = #  This instance mandates that '$' and # are 'reserved tokens' and not part of the input string. This instance is available as an import: GLL.Parseable.Char.

GLL.Parser exports smart constructors for constructing Grammars.

grammar1 = (start "X" , [prod "X" [nterm "A", nterm "A"]
, prod "A" [term 'a']
, prod "A" [term 'a', term 'a']
] )

fail1       = "a"
success1    = "aa"
success2    = "aaa"
fail2       = "aaaaa"


Note that there are two possible derivations of success2.

The parser can be accessed through parse or parseWithOptions.

run1 = parse grammar1 success1
run2 = parseWithOptions [fullSPPF, strictBinarisation] grammar1 success2


The options fullSPPF, allNodes, packedNodesOnly, decide whether all SPPF nodes and edges are inserted into the resulting value of the SPPF type. Packed nodes are enough to fully represent an SPPF, as the parent and children of a packed node can be computed from the packed nodes' information. For efficiency the SPPF is not strictly binarised by default: a packed node might have a symbol node as a left child. In a strictly binarised SPPF a packed node has an intermediate node as a left child, or no left child at all. To create a strictly binarised SPPF (necessary for GLL.Combinators) the option strictBinarisation is available.

### Combinator interface

Module GLL.Combinators.Interface provides a combinator interface to access GLL.Parser. Applicative-like combinators are used to specify a Grammar and call parse. The SPPF is then used to produce semantic results.

Synopsis

# Grammar

type Grammar t = (Nt, Prods t) Source #

A grammar is a start symbol and a list of productions.

type Prods t = [Prod t] Source #

A list of Prods.

data Prod t Source #

A production binds a nonterminal identifier (left-hand side) to a list of symbols (the right-hand side of the production).

Constructors

 Prod Nt (Symbols t)

Instances

 Eq t => Eq (Prod t) Source # Methods(==) :: Prod t -> Prod t -> Bool #(/=) :: Prod t -> Prod t -> Bool # Ord t => Ord (Prod t) Source # Methodscompare :: Prod t -> Prod t -> Ordering #(<) :: Prod t -> Prod t -> Bool #(<=) :: Prod t -> Prod t -> Bool #(>) :: Prod t -> Prod t -> Bool #(>=) :: Prod t -> Prod t -> Bool #max :: Prod t -> Prod t -> Prod t #min :: Prod t -> Prod t -> Prod t # Show t => Show (Prod t) Source # MethodsshowsPrec :: Int -> Prod t -> ShowS #show :: Prod t -> String #showList :: [Prod t] -> ShowS #

type Symbols t = [Symbol t] Source #

A list of Symbols

data Symbol t Source #

A Symbol is either a nonterminal or a terminal, where a terminal contains some arbitrary token.

Constructors

 Nt Nt Term t

Instances

 Eq t => Eq (Symbol t) Source # Methods(==) :: Symbol t -> Symbol t -> Bool #(/=) :: Symbol t -> Symbol t -> Bool # Ord t => Ord (Symbol t) Source # Methodscompare :: Symbol t -> Symbol t -> Ordering #(<) :: Symbol t -> Symbol t -> Bool #(<=) :: Symbol t -> Symbol t -> Bool #(>) :: Symbol t -> Symbol t -> Bool #(>=) :: Symbol t -> Symbol t -> Bool #max :: Symbol t -> Symbol t -> Symbol t #min :: Symbol t -> Symbol t -> Symbol t # Show t => Show (Symbol t) Source # MethodsshowsPrec :: Int -> Symbol t -> ShowS #show :: Symbol t -> String #showList :: [Symbol t] -> ShowS #

data Slot t Source #

A grammar slot acts as a label to identify progress of matching a production. As such, a slot is a Prod with its right-hand side split in two: a part before and a part after 'the dot'. The dot indicates which part of the right-hand side has been processed thus far.

Constructors

 Slot Nt [Symbol t] [Symbol t]

Instances

 Eq t => Eq (Slot t) Source # Methods(==) :: Slot t -> Slot t -> Bool #(/=) :: Slot t -> Slot t -> Bool # Ord t => Ord (Slot t) Source # Methodscompare :: Slot t -> Slot t -> Ordering #(<) :: Slot t -> Slot t -> Bool #(<=) :: Slot t -> Slot t -> Bool #(>) :: Slot t -> Slot t -> Bool #(>=) :: Slot t -> Slot t -> Bool #max :: Slot t -> Slot t -> Slot t #min :: Slot t -> Slot t -> Slot t # Show t => Show (Slot t) Source # MethodsshowsPrec :: Int -> Slot t -> ShowS #show :: Slot t -> String #showList :: [Slot t] -> ShowS #

## Smart constructors for creating Grammars

A smart constructor for creating a start Nt (nonterminal).

prod :: String -> Symbols t -> Prod t Source #

A smart constructor for creating a Prod (production).

A smart constructor for creating a nonterminal Symbol.

term :: t -> Symbol t Source #

A smart constructor for creating a terminal Symbol.

## Parseable tokens

class (Ord a, Eq a, Show a) => Parseable a where Source #

Class that captures elements of an input string (tokens).

• eos is the end-of-string symbol
• eps is the empty-string symbol

Both eos and eps must be distinct from eachother and from all tokens in the input string. The show instance is required to throw error messages.

Minimal complete definition

Methods

eos :: a Source #

eps :: a Source #

matches :: a -> a -> Bool Source #

This function is used for matching grammar tokens and input tokens. Override this method if, for example, your input tokens store lexemes while the grammar tokens do not

unlex :: a -> String Source #

This function pretty-prints the Parseable type by displaying its lexeme. Default implementation is show, which should be replaced for prettier error messages.

Instances

 Source # Methods

# Run the GLL parser

parse :: Parseable t => Grammar t -> [t] -> ParseResult t Source #

Run the GLL parser given a Grammar t and a list of ts, where t is an arbitrary token-type. All token-types must be Parseable.

## Run the GLL parser with options

parseWithOptions :: Parseable t => ParseOptions -> Grammar t -> [t] -> ParseResult t Source #

Run the GLL parser given some options, a Grammar t and a list of ts.

If no options are given a minimal SPPF will be created:

• only packed nodes are created
• the resulting SPPF is not strictly binarised

### ParseOptions

type ParseOptions = [ParseOption] Source #

A list of ParserOptions

type ParseOption = Flags -> Flags Source #

An option updates the current set of Flags.

Fully binarise the SPPF, resulting in a larger SPPF and possibly slower runtimes. When this flag is on, packed nodes can only have a single symbol node child or one intermediate node child and one symbol node child. With the flag disabled a packed node can have two symbol node children.

Create the SPPF with all nodes and edges, not necessarily strictly binarised.

Create all nodes, but no edges between nodes.

Create packed-nodes only.

Set the maximum number of errors shown in case of an unsuccessful parse.

Turn of select tests. Disables lookahead.

## Result

data ParseResult t Source #

The ParseResult datatype contains the SPPF and some other information about the parse:

• SPPF
• Whether the parse was successful
• The number of descriptors that have been processed
• The number of symbol nodes (nonterminal and terminal)
• The number of intermediate noes
• The number of packed nodes
• The number of GSS nodes
• The number of GSS edges

Constructors

 ParseResult Fields

Instances

 Source # MethodsshowsPrec :: Int -> ParseResult t -> ShowS #show :: ParseResult t -> String #showList :: [ParseResult t] -> ShowS #

type SPPF t = (SymbMap t, ImdMap t, PackMap t, EdgeMap t) Source #

An SPPF contains symbol nodes, intermediate nodes, packed nodes and edges between them. See Scott and Johnstone (2013) for an explanation of the SPPF.

data SPPFNode t Source #

An SPPFNode is either a symbol node, an intermediate node, a packed node or a dummy.

Constructors

 SNode (Symbol t, Int, Int) INode (Slot t, Int, Int) PNode (Slot t, Int, Int, Int) Dummy

Instances

 Eq t => Eq (SPPFNode t) Source # Methods(==) :: SPPFNode t -> SPPFNode t -> Bool #(/=) :: SPPFNode t -> SPPFNode t -> Bool # Ord t => Ord (SPPFNode t) Source # Methodscompare :: SPPFNode t -> SPPFNode t -> Ordering #(<) :: SPPFNode t -> SPPFNode t -> Bool #(<=) :: SPPFNode t -> SPPFNode t -> Bool #(>) :: SPPFNode t -> SPPFNode t -> Bool #(>=) :: SPPFNode t -> SPPFNode t -> Bool #max :: SPPFNode t -> SPPFNode t -> SPPFNode t #min :: SPPFNode t -> SPPFNode t -> SPPFNode t #

type SymbMap t = IntMap (IntMap (Set (Symbol t))) Source #

Stores symbol nodes using nested Data.IntMaps, nesting is as follows:

• left extent
• right extent
• set of symbols

type ImdMap t = IntMap (IntMap (Set (Slot t))) Source #

Stores intermediate nodes using nested Data.IntMaps, nesting is as follows:

• left extent
• right extent
• set of slots

type PackMap t = IntMap (IntMap (IntMap (Map (Prod t) IntSet))) Source #

Stores packed nodes using nested Data.IntMaps, nesting is as follows:

• left extent
• right extent
• dot position (from left to right)
• mapping from productions to set of pivots

type EdgeMap t = Map (SPPFNode t) (Set (SPPFNode t)) Source #

Stores edges, potentially costly.

# Orphan instances

 Show t => Show (SPPFNode t) Source # MethodsshowsPrec :: Int -> SPPFNode t -> ShowS #show :: SPPFNode t -> String #showList :: [SPPFNode t] -> ShowS #