Safe Haskell | None |
---|---|
Language | Haskell98 |
This module provides a combinator library in which combinators act as an interface to a back-end parser. This technique is inspired by Tom Ridge (2014) and Peter Ljunglöf (2002). The library is fully general: it enables writing parsers for arbitrary context-free grammars.
The user writes a combinator expression representing a grammar. The represented grammar is extracted and given, together with an input string, to a back-end parser. The derivations constructed by the parser act as a guide in the "semantic phase" in which the combinator expressions are evaluated to produce semantic results for all derivations. Infinitely many derivations would result in a loop. This problem is avoided by discarding the derivations that would arise from such a loop.
This library uses GLL.Parser as the back-end parser and provides
Control.Applicative-like parser combinators: <**>
for sequencing, <||>
for
choice, <$$>
for application, satisfy
instead of pure and derived
combinators <**
, **>
, <$$
, many
, some
and optional
.
The semantic phase might benefit from memoisation which is provided in a separate (impure) library GLL.Combinators.MemInterface.
Example usage
This library differs from parser combinator libraries in that combinator expressions are used to describe a grammar rather than a parser.
A rule is considered to be of the form X ::= a | .. | z, and represented by the combinator expression.
pX = X<::=>
altA<||>
...<||>
altZ
Alternates (a
... z
) start with the application of
a semantic action using <$$>
(or variants <$$
and satisfy
).
The alternate is extended with <**>
(or variants **>
, <**
).
altA = action1<$$>
keychar
'a'<**>
pX altZ = action2<$$>
keychar
'z'
Usability is improved by automatic lifting between expressions that represent symbols
and alternates. The main difference with Control.Applicative style parser combinator
libraries is that the user must use <:=>
(or <::=>
) to represent all recursive
nonterminals and must use <::=>
to represent all nonterminals that potentially lead
to an infinite number of derivations. It is, however, possible to represent left-recursive
nonterminals.
Example
In this example we define expressions for parsing (and evaluating) simple arithmetic expressions for single digit integers.
The library is capable of parsing arbitrary token types that are Parseable
, orderable
and have a Show
instance.
This example demonstrates the usage of the builtin Token
datatype and uses the
elementary parsers keychar
and int_lit
to create parsers for character- and integer-terminals.
We define a very simpler lexer first.
lexer :: String -> [Token] lexer [] = [] lexer (x:xs) | isDigit x =IntLit
(Just (read [x])) : lexer xs | otherwise =Char
x : lexer xs
Note that the Char constructor of the Token
type is used for character-level parsing.
Char contains no lexeme, unlike the Int constructor.
Consider the following (highly ambiguous and left-recursive) grammar:
Expr ::= Expr '-' Expr | Expr '+' Expr | Expr '*' Expr | Expr '/' Expr | INT | '(' Expr ')'
The grammar is translated to the following combinator expression, adding the expected evaluation functions as semantic actions.
pExpr :: Grammar Token Int pExpr = "Expr" ::= (-)<$$>
pExpr<**
keychar
'-'<**>
pExpr<||>
(+)<$$>
pExpr<**
keychar
'+'<**>
pExpr<||>
(*)<$$>
pExpr<**
keychar
'*'<**>
pExpr<||>
div<$$>
pExpr<**
keychar
'/'<**>
pExpr<||>
int_lit
<||>
parens pExpr
Note that <**
is used to ignore the parse result of the second argument and that **>
is used to ignore the parse result of the first argument. These combinators
help us to define the derived combinators within and parens.
within ::BNF
Token
a ->BNF
Token
b ->BNF
Token
a ->BNF
Token
b within l p r =mkRule
$ l**>
p<**
r parens ::BNF
Token
a ->BNF
Token
a parens p = within (keychar
'(') p (keychar
')')
All possible evaluations are obtained by invoking the parse
function.
run1 =parse
pExpr (lexer "1+2*2-5") -- [0,1,0,-5,-9] run2 =parse
pExpr (lexer "((1+(2*2))-3)-5") -- [-3]
With every introduction of an operator +
, -
, *
or /
the number of ambiguities is
multiplied. The number of ambiguities behaves like the sequence https://oeis.org/A000108.
Simple disambiguation
This library offers simple disambiguation strategies that are applied post-parse (the parser still faces the ambiguity, but the semantic evaluation only yields the results according to the strategy). The disambiguations strategies are still in the experimental phase.
We group the operators according to their priorities and use
<::=
to turn the choice operator <||>
into a left-biased operator locally
(use leftBiased
for the same effect globally).
pExpr1 :: Grammar Token Int pExpr1 = "Expr"<::=
( (-)<$$>
pExpr1<**
keychar
'-'<**>
pExpr1<||>
(+)<$$>
pExpr1<**
keychar
'+'<**>
pExpr1 )<||>
( (*)<$$>
pExpr1<**
keychar
'*'<**>
pExpr1<||>
div<$$>
pExpr1<**
keychar
'/'<**>
pExpr1 )<||>
(int_lit
<||>
parens
pExpr1 ) run3 =parseWithOptions
[maximumPivotAtNt
] pExpr1 (lexer "1+2*2-5") -- [0]
The option maximumPivotAtNt
enables the 'longest-match' disambiguation strategy
and makes the arithmetic operators left-associative.
Grammar rewrites
To deal with the particular ambiguities associated with operators we can rewrite the grammar to disambiguate pre-parse.
We define the chainl combinator for parsing chains of left-associative operators.
chainl ::BNF
Token
a ->BNF
Token
(a -> a -> a) ->BNF
Token
a chainl p s =mkRule
$ foldl (flip ($))<$$>
p<**>
many (flip<$$>
s<**>
p)
The expression parser is written with chainl as follows:
pExpr2 :: Grammar Token Int pExpr2 = pE1 where pE1 = chainl pE2 ("E1"<::=>
(+)<$$
keychar
'+'<||>
(-)<$$
keychar
'-') pE2 = chainl pE3 ("E2"<::=>
(*)<$$
keychar
'*'<||>
div<$$
keychar
'/') pE3 = "E3"<::=>
int_lit
<||>
parens pExpr2 run4 =parse
pExpr2
(lexer "1+2*2-5") -- [0]
Pre-parse disambiguation is desirable, as the parsing process could speed up dramatically. In general however, it is not always possible to find the appropriate grammar rewrite and implement it in a high-level combinator such as chainl, motivating the existence of this library.
More simple examples can be found in GLL.Combinators.Test.Interface.
- term_parser :: t -> (t -> a) -> SymbExpr t a
- satisfy :: (Show t, Ord t) => a -> AltExpr t a
- keychar :: SubsumesToken t => Char -> SymbExpr t Char
- keyword :: SubsumesToken t => String -> SymbExpr t String
- int_lit :: SubsumesToken t => SymbExpr t Int
- bool_lit :: SubsumesToken t => SymbExpr t Bool
- string_lit :: SubsumesToken t => SymbExpr t String
- alt_id_lit :: SubsumesToken t => SymbExpr t String
- id_lit :: SubsumesToken t => SymbExpr t String
- token :: SubsumesToken t => String -> SymbExpr t String
- char :: Char -> SymbExpr Char Char
- (<**>) :: (Show t, Ord t, IsAltExpr i, IsSymbExpr s) => i t (a -> b) -> s t a -> AltExpr t b
- (<||>) :: (Show t, Ord t, IsAltExpr i, HasAlts b) => i t a -> b t a -> AltExprs t a
- (<$$>) :: (Show t, Ord t, IsSymbExpr s) => (a -> b) -> s t a -> AltExpr t b
- (<:=>) :: (Show t, Ord t, HasAlts b) => String -> b t a -> SymbExpr t a
- (<::=>) :: (Show t, Ord t, HasAlts b) => String -> b t a -> SymbExpr t a
- type BNF t a = SymbExpr t a
- data SymbExpr t a
- data AltExpr t a
- type AltExprs = OO [] AltExpr
- data Token
- class (Ord a, Eq a, Show a) => Parseable a where
- class SubsumesToken a where
- parse :: (Show t, Parseable t, IsSymbExpr s) => s t a -> [t] -> [a]
- parseWithOptions :: (Show t, Parseable t, IsSymbExpr s) => CombinatorOptions -> s t a -> [t] -> [a]
- type CombinatorOptions = [CombinatorOption]
- type CombinatorOption = PCOptions -> PCOptions
- maximumErrors :: Int -> CombinatorOption
- throwErrors :: CombinatorOption
- fullSPPF :: ParseOption
- allNodes :: ParseOption
- packedNodesOnly :: ParseOption
- strictBinarisation :: ParseOption
- parseWithOptionsAndError :: (Show t, Parseable t, IsSymbExpr s) => CombinatorOptions -> s t a -> [t] -> Either String [a]
- parseResult :: (Show t, Parseable t, IsSymbExpr s) => s t a -> [t] -> ParseResult t
- parseResultWithOptions :: (Show t, Parseable t, IsSymbExpr s) => ParseOptions -> CombinatorOptions -> s t a -> [t] -> ParseResult t
- data ParseResult t = ParseResult {}
- default_lexer :: SubsumesToken t => String -> [t]
- lexer :: SubsumesToken t => LexerSettings -> String -> [t]
- data LexerSettings = LexerSettings {}
- emptyLanguage :: LexerSettings
- mkNt :: (Show t, Ord t, IsSymbExpr s) => s t a -> String -> String
- (<$$) :: (Show t, Ord t, IsSymbExpr s) => b -> s t a -> AltExpr t b
- (**>) :: (Show t, Ord t, IsAltExpr i, IsSymbExpr s) => i t a -> s t b -> AltExpr t b
- (<**) :: (Show t, Ord t, IsAltExpr i, IsSymbExpr s) => i t a -> s t b -> AltExpr t a
- optional :: (Show t, Ord t, IsSymbExpr s) => s t a -> SymbExpr t (Maybe a)
- multiple :: (Show t, Ord t, IsSymbExpr s) => s t a -> SymbExpr t [a]
- multiple1 :: (Show t, Ord t, IsSymbExpr s) => s t a -> SymbExpr t [a]
- multipleSepBy :: (Show t, Ord t, IsSymbExpr s, IsSymbExpr s2, IsAltExpr s2) => s t a -> s2 t b -> SymbExpr t [a]
- multipleSepBy1 :: (Show t, Ord t, IsSymbExpr s, IsSymbExpr s2, IsAltExpr s2) => s t a -> s2 t b -> SymbExpr t [a]
- (<::=) :: (Ord t, Show t, HasAlts b) => String -> b t a -> SymbExpr t a
- lassoc :: AltExpr t a -> AltExpr t a
- rassoc :: AltExpr t a -> AltExpr t a
- assoc :: AltExpr t a -> AltExpr t a
- many :: (Show t, Ord t, IsSymbExpr s) => s t a -> SymbExpr t [a]
- many1 :: (Show t, Ord t, IsSymbExpr s) => s t a -> SymbExpr t [a]
- some :: (Show t, Ord t, IsSymbExpr s) => s t a -> SymbExpr t [a]
- some1 :: (Show t, Ord t, IsSymbExpr s) => s t a -> SymbExpr t [a]
- manySepBy :: (Show t, Ord t, IsSymbExpr s, IsSymbExpr s2, IsAltExpr s2) => s t a -> s2 t b -> SymbExpr t [a]
- manySepBy1 :: (Show t, Ord t, IsSymbExpr s, IsSymbExpr s2, IsAltExpr s2) => s t a -> s2 t b -> SymbExpr t [a]
- someSepBy :: (Show t, Ord t, IsSymbExpr s, IsSymbExpr s2, IsAltExpr s2) => s t a -> s2 t b -> SymbExpr t [a]
- someSepBy1 :: (Show t, Ord t, IsSymbExpr s, IsSymbExpr s2, IsAltExpr s2) => s t a -> s2 t b -> SymbExpr t [a]
- class HasAlts a where
- class IsSymbExpr a where
- class IsAltExpr a where
- memo :: (Ord t, Show t, IsSymbExpr s) => MemoRef [a] -> s t a -> SymbExpr t a
- newMemoTable :: MemoRef a
- memClear :: MemoRef a -> IO ()
- type MemoTable a = IntMap (IntMap a)
- type MemoRef a = IORef (MemoTable a)
- useMemoisation :: CombinatorOption
Elementary parsers
term_parser :: t -> (t -> a) -> SymbExpr t a Source
satisfy :: (Show t, Ord t) => a -> AltExpr t a Source
The empty right-hand side that yields its first argument as a semantic result.
Elementary parsers using the Token
datatype
keychar :: SubsumesToken t => Char -> SymbExpr t Char Source
Parse a single character, using a SubsumesToken
type.
keyword :: SubsumesToken t => String -> SymbExpr t String Source
Parse a single character, using a SubsumesToken
type.
int_lit :: SubsumesToken t => SymbExpr t Int Source
Parse a single integer, using a SubsumesToken
type.
Returns the lexeme interpreted as an integer.
bool_lit :: SubsumesToken t => SymbExpr t Bool Source
Parse a single Boolean, using a SubsumesToken
type.
Returns the lexeme interpreter as a Boolean.
string_lit :: SubsumesToken t => SymbExpr t String Source
Parse a single String literal, using a SubsumesToken
type.
Returns the lexeme interpreted as a String literal.
alt_id_lit :: SubsumesToken t => SymbExpr t String Source
Parse a single alternative identifier, using a SubsumesToken
type.
Returns the lexeme as a String.
id_lit :: SubsumesToken t => SymbExpr t String Source
Parse a single identifier, using a SubsumesToken
type.
Returns the lexeme as a String.
token :: SubsumesToken t => String -> SymbExpr t String Source
Parse a single arbitrary token, using a SubsumesToken
type.
Returns the lexeme.
Elementary character-level parsers
char :: Char -> SymbExpr Char Char Source
Parse a single character.
char c = term_parser c id
Currently, this is the only character-level combinator exported by this module. Please use token-level combinators for practical parsing. Might change in the future.
Elementary combinators
Sequencing
(<**>) :: (Show t, Ord t, IsAltExpr i, IsSymbExpr s) => i t (a -> b) -> s t a -> AltExpr t b infixl 4 Source
Choice
Semantic actions
(<$$>) :: (Show t, Ord t, IsSymbExpr s) => (a -> b) -> s t a -> AltExpr t b infixl 4 Source
Form an AltExpr
by mapping some semantic action overy the result
of the second argument.
Nonterminal introduction
(<:=>) :: (Show t, Ord t, HasAlts b) => String -> b t a -> SymbExpr t a infixl 2 Source
Form a rule by giving the name of the left-hand side of the new rule. Use this combinator on recursive non-terminals.
(<::=>) :: (Show t, Ord t, HasAlts b) => String -> b t a -> SymbExpr t a infixl 2 Source
Variant of <:=>
for recursive non-terminals that have a potentially infinite
number of derivations for some input string.
A non-terminal yields infinitely many derivations if and only if it is left-recursive and would be left-recursive if all the right-hand sides of the productions of the grammar are reversed.
Types
Grammar (combinator expression) types
type BNF t a = SymbExpr t a Source
A combinator expression representing a BNF-grammar. The terminals of
the grammar are of type t
. When used to parse, the expression yields
semantic results of type a
.
A combinator expression representing an alternative: the right-hand side of a production.
type AltExprs = OO [] AltExpr Source
A list of alternatives represents the right-hand side of a rule.
Parseable token types
A datatype for representing tokens with some builtins and an aribitrary Token constructor. This datatype stores (optional) lexemes.
class (Ord a, Eq a, Show a) => Parseable a where Source
Class that captures elements of an input string (tokens).
Both eos
and eps
must be distinct from eachother and from all
tokens in the input string.
The show instance is required to throw error messages.
class SubsumesToken a where Source
Class whose members are super-types of Token
.
Running a parser
parse :: (Show t, Parseable t, IsSymbExpr s) => s t a -> [t] -> [a] Source
Runs a parser given a string of Parseable
s and returns a list of
semantic results, corresponding to all finitely many derivations.
Running a parser with options
parseWithOptions :: (Show t, Parseable t, IsSymbExpr s) => CombinatorOptions -> s t a -> [t] -> [a] Source
Run the parser with some CombinatorOptions
.
Possible options
type CombinatorOptions = [CombinatorOption] Source
A list of CombinatorOption
s for evaluating combinator expressions.
type CombinatorOption = PCOptions -> PCOptions Source
A single option.
maximumErrors :: Int -> CombinatorOption Source
Set the maximum number of errors shown in case of an unsuccessful parse.
throwErrors :: CombinatorOption Source
If there are no parse results, the default behaviour is to return an empty list. If this option is used, a runtime error will be reported, with debugging information.
Parser options
fullSPPF :: ParseOption Source
Create the SPPF
with all nodes and edges, not necessarily strictly binarised.
allNodes :: ParseOption Source
Create all nodes, but no edges between nodes.
packedNodesOnly :: ParseOption Source
Create packed-nodes only.
strictBinarisation :: ParseOption Source
Fully binarise the SPPF, resulting in a larger SPPF
and possibly slower runtimes.
When this flag is on, packed nodes can only have a single symbol node child
or one intermediate node child and one symbol node child.
With the flag disabled a packed node can have two symbol node children.
Running a parser with options and explicit failure
parseWithOptionsAndError :: (Show t, Parseable t, IsSymbExpr s) => CombinatorOptions -> s t a -> [t] -> Either String [a] Source
Run the parser with some CombinatorOptions
and return either an error or the results.
Any returned results will be a list of length greater than 0.
Runing a parser to obtain ParseResult
.
parseResult :: (Show t, Parseable t, IsSymbExpr s) => s t a -> [t] -> ParseResult t Source
Get the ParseResult
, containing an SPPF
,
produced by parsing the given input with the given parser.
parseResultWithOptions :: (Show t, Parseable t, IsSymbExpr s) => ParseOptions -> CombinatorOptions -> s t a -> [t] -> ParseResult t Source
Get the ParseResult
given some ParseOptions
and CombinatorOptions
.
data ParseResult t Source
The ParseResult datatype contains the SPPF and some other information about the parse:
SPPF
- Whether the parse was successful
- The number of descriptors that have been processed
- The number of symbol nodes (nonterminal and terminal)
- The number of intermediate noes
- The number of packed nodes
- The number of GSS nodes
- The number of GSS edges
ParseResult | |
|
Show (ParseResult t) Source |
Builtin lexers.
default_lexer :: SubsumesToken t => String -> [t] Source
A lexer using the default LexerSettings
.
Lexer settings
lexer :: SubsumesToken t => LexerSettings -> String -> [t] Source
A lexer parameterised by LexerSettings
.
data LexerSettings Source
Settings for changing the behaviour of the builtin lexer lexer
.
Lexers are built using Text.Regex.Applicative.
LexerSettings | |
|
emptyLanguage :: LexerSettings Source
The default LexerSettings
.
Derived combinators
Ignoring semantic results
(<$$) :: (Show t, Ord t, IsSymbExpr s) => b -> s t a -> AltExpr t b infixl 4 Source
Version of <$$>
that ignores the semantic result of its second argument.
(**>) :: (Show t, Ord t, IsAltExpr i, IsSymbExpr s) => i t a -> s t b -> AltExpr t b infixl 4 Source
Version of <**>
that ignores the semantic result of the first argument.
(<**) :: (Show t, Ord t, IsAltExpr i, IsSymbExpr s) => i t a -> s t b -> AltExpr t a infixl 4 Source
Version of <**>
that ignores the semantic result of the second argument.
EBNF patterns
optional :: (Show t, Ord t, IsSymbExpr s) => s t a -> SymbExpr t (Maybe a) Source
Try to apply a parser once, but proceed if unsuccessful
(yielding Nothing
as a result in that case).
multiple :: (Show t, Ord t, IsSymbExpr s) => s t a -> SymbExpr t [a] Source
Try to apply a parser multiple times (0 or more). The results are returned in a list. In the case of ambiguity the largest list is returned.
multiple1 :: (Show t, Ord t, IsSymbExpr s) => s t a -> SymbExpr t [a] Source
Try to apply a parser multiple times (1 or more). The results are returned in a list. In the case of ambiguity the largest list is returned.
multipleSepBy :: (Show t, Ord t, IsSymbExpr s, IsSymbExpr s2, IsAltExpr s2) => s t a -> s2 t b -> SymbExpr t [a] Source
Same as multiple
but with an additional separator.
multipleSepBy1 :: (Show t, Ord t, IsSymbExpr s, IsSymbExpr s2, IsAltExpr s2) => s t a -> s2 t b -> SymbExpr t [a] Source
Same as multiple1
but with an additional separator.
Disambiguation
(<::=) :: (Ord t, Show t, HasAlts b) => String -> b t a -> SymbExpr t a infixl 2 Source
Version of <::=
that prioritises productions from left-to-right (or top-to-bottom).
lassoc :: AltExpr t a -> AltExpr t a Source
Apply this combinator to an alternative to indicate that the alternative is for a left-associative operator. Used for top-down disambiguation.
rassoc :: AltExpr t a -> AltExpr t a Source
Apply this combinator to an alternative to indicate that the alternative is for a right-associative operator. Used for top-down disambiguation.
assoc :: AltExpr t a -> AltExpr t a Source
Apply this combinator to an alternative to indicate that the alternative is for an associative operator. Used for top-down disambiguation.
many :: (Show t, Ord t, IsSymbExpr s) => s t a -> SymbExpr t [a] Source
Try to apply a parser multiple times (0 or more). The results are returned in a list. In the case of ambiguity the largest list is returned.
many1 :: (Show t, Ord t, IsSymbExpr s) => s t a -> SymbExpr t [a] Source
Try to apply a parser multiple times (1 or more). The results are returned in a list. In the case of ambiguity the largest list is returned.
some :: (Show t, Ord t, IsSymbExpr s) => s t a -> SymbExpr t [a] Source
Try to apply a parser multiple times (0 or more). The results are returned in a list. In the case of ambiguity the largest list is returned.
some1 :: (Show t, Ord t, IsSymbExpr s) => s t a -> SymbExpr t [a] Source
Try to apply a parser multiple times (1 or more). The results are returned in a list. In the case of ambiguity the largest list is returned.
manySepBy :: (Show t, Ord t, IsSymbExpr s, IsSymbExpr s2, IsAltExpr s2) => s t a -> s2 t b -> SymbExpr t [a] Source
Same as many
but with an additional separator.
manySepBy1 :: (Show t, Ord t, IsSymbExpr s, IsSymbExpr s2, IsAltExpr s2) => s t a -> s2 t b -> SymbExpr t [a] Source
Same as many1
but with an additional separator.
someSepBy :: (Show t, Ord t, IsSymbExpr s, IsSymbExpr s2, IsAltExpr s2) => s t a -> s2 t b -> SymbExpr t [a] Source
Same as some1
but with an additional separator.
someSepBy1 :: (Show t, Ord t, IsSymbExpr s, IsSymbExpr s2, IsAltExpr s2) => s t a -> s2 t b -> SymbExpr t [a] Source
Same as some1
but with an additional separator.
Lifting
Class for lifting to AltExprs
.
class IsSymbExpr a where Source
Class for lifting to SymbExpr
.
Memoisation
memo :: (Ord t, Show t, IsSymbExpr s) => MemoRef [a] -> s t a -> SymbExpr t a Source
This function memoises a parser, given:
- A
MemoRef
pointing to a freshMemoTable
, created usingnewMemoTable
. - The
SymbExpr
to memoise.
Use memo
on all parsers that are expected to derive the same
substring multiple times. If the same combinator expression is used
to parse multiple times the MemoRef
needs to be cleared using memClear
.
memo
relies on unsafePerformIO
and is therefore potentially unsafe.
The option useMemoisation
enables memoisation.
It is off by default, even if memo
is used in a combinator expression.
newMemoTable :: MemoRef a Source
Create a reference to a fresh MemoTable
.
type MemoTable a = IntMap (IntMap a) Source
A MemoTable
maps left-extent l to right-extent r to some results a
indicating the the substring ranging from l to r is derived with parse result a.
useMemoisation :: CombinatorOption Source
Whether to use unsafe memoisation to speed up the enumeration of parse results.