Safe Haskell | Safe-Inferred |
---|
The usual way to use Parsec, at least as far as the tutorials are concerned, is to do scannerless parsing. That is, the parser operates directly on input characters without first parsing out tokens. Scanner parsing certainly has its advantages, particularly in casual parsing, but it comes with its disadvantages too, especially when the complexity of the grammar increases.
Oftentimes, complex grammars can isolate that complexity into contiguous portions of input which are called lexemes (also called tokens). In these cases, it is beneficial on many fronts to introduce a lexer (also called tokenizer or scanner) before parsing. The lexer and parser stand in a producer-consumer relationship.
This module implements the lexer-parser pattern on top of Parsec. Thus, you can take advantage of a familiar and battle-tested library to implement both your lexer and parser.
The benefits of isolating complexity into the parser are:
- More reliability: because Parsec does not backtrack by default, it is easy to
accidentally forget a
try
combinator and introduce bugs that are difficult to detect and diagnose. Isolating complexity into the lexemes drastically reduces the scope of debugging: from the entire grammar down to only the part of the lexer responsible for a single kind of token. - Increased performance: backtracking can be isolated to the lexer. This means that backtracking has limited scope, and the parser can take advantage of no-lookahead parsing algorithms.
- Enhanced error reports: many places where a scannerless parser might report an unexpected character, a lexer-parser will naturally report an entire token. The added context is much easier to use for debugging.
The most important definitions for understanding this module are
Lexeme
, runLuthorT
, lexeme
, and satisfy
. The general structure of a
Luthor-based parser is: 1) create a data type for Lexeme
payloads, 2) write your
lexers and wrap them all in lexeme
, 3) create parsers for each case of payload
using unlex
and satisfy', 4) write your parsers and connect them with your lexers
using runLuthorT
.
- type LexT s u m t = ParsecT s u m (Lexeme t)
- type LuthorT t = ParsecT [Lexeme t]
- data Lexeme a
- runLuthorT :: (Monad m, Stream s m x, Stream [Lexeme t] m y, Show x) => LexT s u m t -> LuthorT t u m a -> u -> SourceName -> s -> m (Either ParseError a)
- type Lex s u a = LexT s u Identity a
- type Luthor s u = LuthorT s u Identity
- runLuthor :: (Stream s Identity x, Stream [Lexeme t] Identity y, Show x) => Lex s u t -> Luthor t u a -> u -> SourceName -> s -> Either ParseError a
- lexeme :: Monad m => ParsecT s u m a -> ParsecT s u m (Lexeme a)
- ignore :: Monad m => (t -> Bool) -> LuthorT t u m ()
- unlex :: (Show a, Stream [Lexeme a] m (Lexeme a)) => LuthorT a u m a
- unlexWith :: (Show a, Stream [Lexeme a] m (Lexeme a)) => (a -> Maybe b) -> LuthorT a u m b
- satisfy :: (Show a, Stream [Lexeme a] m (Lexeme a)) => (a -> Bool) -> LuthorT a u m a
- endOfLexemes :: (Show a, Stream [Lexeme a] m (Lexeme a)) => LuthorT a u m ()
- isAtEnd :: Monad m => LuthorT a u m Bool
Basic Concepts
:: (Monad m, Stream s m x, Stream [Lexeme t] m y, Show x) | |
=> LexT s u m t | lexer: transform raw input stream into many tokens |
-> LuthorT t u m a | parser: transform token stream into parsed data |
-> u | |
-> SourceName | |
-> s | |
-> m (Either ParseError a) |
Connect and run a lexer and parser together.
Shortcuts
:: (Stream s Identity x, Stream [Lexeme t] Identity y, Show x) | |
=> Lex s u t | lexer: transform raw input stream into many tokens |
-> Luthor t u a | parser: transform token stream into parsed data |
-> u | |
-> SourceName | |
-> s | |
-> Either ParseError a |
As runLuthorT
in the Identity
monad.
Produce Lexemes
lexeme :: Monad m => ParsecT s u m a -> ParsecT s u m (Lexeme a)Source
Wrap a normal parser into a parser for a Lexeme
.
The passed parser determines the payload. Use this function
at the boundary of the lexer and parser: when a complete lexeme
has been recognized, but parsing has not been performed.
On failure, the resulting parser does not consume input, even if the passed parser would.
ignore :: Monad m => (t -> Bool) -> LuthorT t u m ()Source
Drop lexemes from the input stream whose payloads that satisfy the passed predicate. This is useful, for example, in whitespace-insensitive languages to remove extraneous whitespace so it need not clutter the parser definition.
WARNING: use only as the first action in a parsing monad.
Consume Lexemes
unlex :: (Show a, Stream [Lexeme a] m (Lexeme a)) => LuthorT a u m aSource
Unpacks a payload from the Lexeme
stream. Only fails at the end of the lexeme stream.
satisfy :: (Show a, Stream [Lexeme a] m (Lexeme a)) => (a -> Bool) -> LuthorT a u m aSource
Obtain a lexeme from the stream only if it satisfies the passed predicate.
This is a very important function for building parsers. Generally,
after defining your data type for Lexeme
payloads, you will implement
several combinators for recognizing each case of your data type.