luthor-0.0.2: Tools for lexing and utilizing lexemes that integrate with Parsec.

Safe HaskellSafe
LanguageHaskell98

Text.Luthor.Lex

Contents

Description

The usual way to use Parsec, at least as far as the tutorials are concerned, is to do scannerless parsing. That is, the parser operates directly on input characters without first parsing out tokens. Scanner parsing certainly has its advantages, particularly in casual parsing, but it comes with its disadvantages too, especially when the complexity of the grammar increases.

Oftentimes, complex grammars can isolate that complexity into contiguous portions of input which are called lexemes (also called tokens). In these cases, it is beneficial on many fronts to introduce a lexer (also called tokenizer or scanner) before parsing. The lexer and parser stand in a producer-consumer relationship.

This module implements the lexer-parser pattern on top of Parsec. Thus, you can take advantage of a familiar and battle-tested library to implement both your lexer and parser.

The benefits of isolating complexity into the parser are:

  • More reliability: because Parsec does not backtrack by default, it is easy to accidentally forget a try combinator and introduce bugs that are difficult to detect and diagnose. Isolating complexity into the lexemes drastically reduces the scope of debugging: from the entire grammar down to only the part of the lexer responsible for a single kind of token.
  • Increased performance: backtracking can be isolated to the lexer. This means that backtracking has limited scope, and the parser can take advantage of no-lookahead parsing algorithms.
  • Enhanced error reports: many places where a scannerless parser might report an unexpected character, a lexer-parser will naturally report an entire token. The added context is much easier to use for debugging.

The most important definitions for understanding this module are Lexeme, runLuthorT, lexeme, and satisfy. The general structure of a Luthor-based parser is: 1) create a data type for Lexeme payloads, 2) write your lexers and wrap them all in lexeme, 3) create parsers for each case of payload using unlex and satisfy', 4) write your parsers and connect them with your lexers using runLuthorT.

Synopsis

Basic Concepts

type LexT s u m t = ParsecT s u m (Lexeme t) Source

A lexer: producer of Lexemes

type LuthorT t = ParsecT [Lexeme t] Source

A parser: consumer of Lexemes

data Lexeme a Source

The lexer and parser communicate using the Lexeme type.

Lexemes carry position information as well as a payload. Generally, you will want to define a data type for the payload and use satisfy to extract particular cases of payloads for parsing.

Instances

runLuthorT Source

Arguments

:: (Monad m, Stream s m x, Stream [Lexeme t] m y, Show x) 
=> LexT s u m t

lexer: transform raw input stream into many tokens

-> LuthorT t u m a

parser: transform token stream into parsed data

-> u 
-> SourceName 
-> s 
-> m (Either ParseError a) 

Connect and run a lexer and parser together.

Shortcuts

type Lex s u a = LexT s u Identity a Source

Synonym for lexers over the Identity monad.

type Luthor s u = LuthorT s u Identity Source

Synonym for parsers over the Identity monad.

runLuthor Source

Arguments

:: (Stream s Identity x, Stream [Lexeme t] Identity y, Show x) 
=> Lex s u t

lexer: transform raw input stream into many tokens

-> Luthor t u a

parser: transform token stream into parsed data

-> u 
-> SourceName 
-> s 
-> Either ParseError a 

As runLuthorT in the Identity monad.

Produce Lexemes

lexeme :: Monad m => ParsecT s u m a -> ParsecT s u m (Lexeme a) Source

Wrap a normal parser into a parser for a Lexeme. The passed parser determines the payload. Use this function at the boundary of the lexer and parser: when a complete lexeme has been recognized, but parsing has not been performed.

On failure, the resulting parser does not consume input, even if the passed parser would.

ignore :: Monad m => (t -> Bool) -> LuthorT t u m () Source

Drop lexemes from the input stream whose payloads that satisfy the passed predicate. This is useful, for example, in whitespace-insensitive languages to remove extraneous whitespace so it need not clutter the parser definition.

WARNING: use only as the first action in a parsing monad.

Consume Lexemes

unlex :: (Show a, Stream [Lexeme a] m (Lexeme a)) => LuthorT a u m a Source

Unpacks a payload from the Lexeme stream. Only fails at the end of the lexeme stream.

unlexWith :: (Show a, Stream [Lexeme a] m (Lexeme a)) => (a -> Maybe b) -> LuthorT a u m b Source

Unpacks a payload from the Lexeme stream and attempts to transform it. If the transformation fails (evaluates to Nothing), then this parser fails.

satisfy :: (Show a, Stream [Lexeme a] m (Lexeme a)) => (a -> Bool) -> LuthorT a u m a Source

Obtain a lexeme from the stream only if it satisfies the passed predicate.

This is a very important function for building parsers. Generally, after defining your data type for Lexeme payloads, you will implement several combinators for recognizing each case of your data type.

endOfLexemes :: (Show a, Stream [Lexeme a] m (Lexeme a)) => LuthorT a u m () Source

Succeed only at the end of the lexeme stream.

isAtEnd :: Monad m => LuthorT a u m Bool Source

Detect whether the parser is at the end of the lexeme stream without consuming input.