Safe Haskell	None
Language	Haskell2010

Language.Lexer.Applicative

Contents

Building a Lexer
- Building Recognizers
Running a Lexer
- Working with a token stream

Description

For some background, see https://ro-che.info/articles/2015-01-02-lexical-analysis

Synopsis

Building a Lexer

data Lexer tok Source

A Lexer specification consists of two recognizers: one for meaningful tokens and one for whitespace and comments.

Although you can construct Lexers directly, it is more convenient to build them with token, whitespace, and the Monoid instance like this:

 myLexer :: Lexer MyToken
 myLexer = mconcat
   [ token      (longest myToken)
   , whitespace (longest myWhiteSpace)
   , whitespace (longestShortest myComment)
   ]

Constructors

Lexer
Fields lexerTokenRE :: Recognizer tok lexerWhitespaceRE :: Recognizer ()

Instances

Functor Lexer
Monoid (Lexer tok)

token :: Recognizer tok -> Lexer tok Source

Build a lexer with the given token recognizer and no (i.e. mempty) whitespace recognizer.

token is a monoid homomorphism:

token a <> token b = token (a <> b)

whitespace :: Recognizer a -> Lexer tok Source

Build a lexer with the given whitespace recognizer and no (i.e. mempty) token recognizer.

whitespace is a monoid homomorphism:

whitespace a <> whitespace b = whitespace (a <> b)

Building Recognizers

data Recognizer tok Source

A token recognizer

Recognizer values are constructed by functions like longest and longestShortest, combined with mappend, and used by token and whitespace.

When a recognizer returns without consuming any characters, a lexical error is signaled.

Instances

Functor Recognizer
Monoid (Recognizer tok)

longest :: RE Char tok -> Recognizer tok Source

When scanning a next token, the regular expression will compete with the other Recognizers of its Lexer. If it wins, its result will become the next token.

longest has the following properties:

longest (r1 <|> r2) = longest r1 <> longest r2

longest r = longestShortest const r (const $ pure ())

longestShortest Source

Arguments

:: (pref -> suff -> tok)
-> RE Char pref	regex for the longest prefix
-> (pref -> RE Char suff)	regex for the shortest suffix
-> Recognizer tok

This is a more sophisticated recognizer than longest.

It recognizes a token consisting of a prefix and a suffix, where prefix is chosen longest, and suffix is chosen shortest.

An example would be a C block comment

/* comment text */

The naive

longest (string "/*" *> many anySym *> string "*/")

doesn't work because it consumes too much: in

/* xxx */ yyy /* zzz */

it will treat the whole line as a comment.

This is where longestShortest comes in handy:

longestShortest
   (\_ _ -> ()) -- don't care about the comment text
   (string "/*")
   (\_ -> many anySym *> string "*/")

Operationally, the prefix regex first competes with other Recognizers for the longest match. If it wins, then the shortest match for the suffix regex is found, and the two results are combined with the given function to produce a token.

The two regular expressions combined must consume some input, or else LexicalError is thrown. However, any one of them may return without consuming input.

* * *

Once the prefix regex wins, the choice is committed; the suffix regex must match or else a LexicalError is thrown. Therefore,

longestShortest f pref suff1
         <>
longestShortest f pref suff2
         =
longestShortest f pref suff1

and is not the same as

longestShortest f pref (suff1 <|> suff2)

The following holds, however:

longestShortest f pref1 suff
         <>
longestShortest f pref2 suff
         =
longestShortest f (pref1 <|> pref2) suff

* * *

Passing the result of prefix into both suffix and combining function may seem superfluous; indeed we could get away with

RE Char pref -> (pref -> RE Char tok) -> Recognizer tok

or even

RE Char (RE Char tok) -> Recognizer tok

This is done purely for convenience and readability; the intention is that pref passed into suffix is used to customize the regular expression which would still return only its part of the token, and then the function will combine the two parts. Of course, you don't need to follow this recommendation. Thanks to parametricity, all three versions are equivalent.

Running a Lexer

runLexer Source

Arguments

:: Lexer tok	lexer specification
-> String	source file name (used in locations)
-> String	source text
-> TokenStream (L tok)

Run a lexer on a string and produce a lazy stream of tokens

Working with a token stream

data TokenStream tok Source

A stream of tokens

Constructors

TsToken tok (TokenStream tok)
TsEof
TsError LexicalError

Instances

Functor TokenStream
IsList (TokenStream tok)
Eq tok => Eq (TokenStream tok)
Show tok => Show (TokenStream tok)
type Item (TokenStream tok) = tok

streamToList :: TokenStream tok -> [tok] Source

Convert a TokenStream to a list of tokens. Turn TsError into a runtime LexicalError exception.

streamToEitherList :: TokenStream tok -> Either LexicalError [tok] Source

Convert a TokenStream into either a token list or a LexicalError. This function may be occasionally useful, but in general its use is discouraged because it needs to force the whole stream before returning a result.

data LexicalError Source

The lexical error exception

Constructors

LexicalError !Pos

Instances

Eq LexicalError
Show LexicalError
Exception LexicalError
Typeable * LexicalError