lexer-applicative-2.1.0.2: Simple lexer based on applicative regular expressions

Safe HaskellNone
LanguageHaskell2010

Language.Lexer.Applicative

Contents

Description

Synopsis

Building a Lexer

data Lexer tok Source #

A Lexer specification consists of two recognizers: one for meaningful tokens and one for whitespace and comments.

Although you can construct Lexers directly, it is more convenient to build them with token, whitespace, and the Monoid instance like this:

 myLexer :: Lexer MyToken
 myLexer = mconcat
   [ token      (longest myToken)
   , whitespace (longest myWhiteSpace)
   , whitespace (longestShortest myCommentPrefix myCommentSuffix)
   ]

Constructors

Lexer 

Instances

Functor Lexer Source # 

Methods

fmap :: (a -> b) -> Lexer a -> Lexer b #

(<$) :: a -> Lexer b -> Lexer a #

Semigroup (Lexer tok) Source # 

Methods

(<>) :: Lexer tok -> Lexer tok -> Lexer tok #

sconcat :: NonEmpty (Lexer tok) -> Lexer tok #

stimes :: Integral b => b -> Lexer tok -> Lexer tok #

Monoid (Lexer tok) Source # 

Methods

mempty :: Lexer tok #

mappend :: Lexer tok -> Lexer tok -> Lexer tok #

mconcat :: [Lexer tok] -> Lexer tok #

token :: Recognizer tok -> Lexer tok Source #

Build a lexer with the given token recognizer and no (i.e. mempty) whitespace recognizer.

token is a monoid homomorphism:

token a <> token b = token (a <> b)

whitespace :: Recognizer a -> Lexer tok Source #

Build a lexer with the given whitespace recognizer and no (i.e. mempty) token recognizer.

whitespace is a monoid homomorphism:

whitespace a <> whitespace b = whitespace (a <> b)

Building Recognizers

data Recognizer tok Source #

A token recognizer

Recognizer values are constructed by functions like longest and longestShortest, combined with mappend, and used by token and whitespace.

When a recognizer returns without consuming any characters, a lexical error is signaled.

Instances

Functor Recognizer Source # 

Methods

fmap :: (a -> b) -> Recognizer a -> Recognizer b #

(<$) :: a -> Recognizer b -> Recognizer a #

Semigroup (Recognizer tok) Source # 

Methods

(<>) :: Recognizer tok -> Recognizer tok -> Recognizer tok #

sconcat :: NonEmpty (Recognizer tok) -> Recognizer tok #

stimes :: Integral b => b -> Recognizer tok -> Recognizer tok #

Monoid (Recognizer tok) Source # 

Methods

mempty :: Recognizer tok #

mappend :: Recognizer tok -> Recognizer tok -> Recognizer tok #

mconcat :: [Recognizer tok] -> Recognizer tok #

longest :: RE Char tok -> Recognizer tok Source #

When scanning a next token, the regular expression will compete with the other Recognizers of its Lexer. If it wins, its result will become the next token.

longest has the following properties:

longestShortest Source #

Arguments

:: RE Char pref

regex for the longest prefix

-> (pref -> RE Char tok)

regex for the shortest suffix

-> Recognizer tok 

This is a more sophisticated recognizer than longest.

It recognizes a token consisting of a prefix and a suffix, where prefix is chosen longest, and suffix is chosen shortest.

An example would be a C block comment

/* comment text */

The naive

longest (string "/*" *> many anySym *> string "*/")

doesn't work because it consumes too much: in

/* xxx */ yyy /* zzz */

it will treat the whole line as a comment.

This is where longestShortest comes in handy:

longestShortest
   (string "/*")
   (\_ -> many anySym *> string "*/")

Operationally, the prefix regex first competes with other Recognizers for the longest match. If it wins, then the shortest match for the suffix regex is found, and the two results are combined with the given function to produce a token.

The two regular expressions combined must consume some input, or else LexicalError is thrown. However, any one of them may return without consuming input.

* * *

Once the prefix regex wins, the choice is committed; the suffix regex must match or else a LexicalError is thrown. Therefore,

longestShortest pref suff1
         <>
longestShortest pref suff2
         =
longestShortest pref suff1

and is not the same as

longestShortest pref (suff1 <|> suff2)

The following holds, however:

longestShortest pref1 suff
         <>
longestShortest pref2 suff
         =
longestShortest (pref1 <|> pref2) suff

Running a Lexer

runLexer Source #

Arguments

:: Lexer tok

lexer specification

-> String

source file name (used in locations)

-> String

source text

-> TokenStream (L tok) 

Run a lexer on a string and produce a lazy stream of tokens

Working with a token stream

data TokenStream tok Source #

A stream of tokens

Constructors

TsToken tok (TokenStream tok) 
TsEof 
TsError LexicalError 

Instances

Functor TokenStream Source # 

Methods

fmap :: (a -> b) -> TokenStream a -> TokenStream b #

(<$) :: a -> TokenStream b -> TokenStream a #

Eq tok => Eq (TokenStream tok) Source # 

Methods

(==) :: TokenStream tok -> TokenStream tok -> Bool #

(/=) :: TokenStream tok -> TokenStream tok -> Bool #

Show tok => Show (TokenStream tok) Source # 

Methods

showsPrec :: Int -> TokenStream tok -> ShowS #

show :: TokenStream tok -> String #

showList :: [TokenStream tok] -> ShowS #

streamToList :: TokenStream tok -> [tok] Source #

Convert a TokenStream to a list of tokens. Turn TsError into a runtime LexicalError exception.

streamToEitherList :: TokenStream tok -> Either LexicalError [tok] Source #

Convert a TokenStream into either a token list or a LexicalError. This function may be occasionally useful, but in general its use is discouraged because it needs to force the whole stream before returning a result.