| Safe Haskell | None |
|---|---|
| Language | Haskell2010 |
Language.Lexer.Applicative
Description
For some background, see https://ro-che.info/articles/2015-01-02-lexical-analysis
- data Lexer tok = Lexer {
- lexerTokenRE :: Recognizer tok
- lexerWhitespaceRE :: Recognizer ()
- token :: Recognizer tok -> Lexer tok
- whitespace :: Recognizer a -> Lexer tok
- data Recognizer tok
- longest :: RE Char tok -> Recognizer tok
- longestShortest :: (pref -> suff -> tok) -> RE Char pref -> (pref -> RE Char suff) -> Recognizer tok
- runLexer :: forall tok. Lexer tok -> String -> String -> TokenStream (L tok)
- data TokenStream tok
- = TsToken tok (TokenStream tok)
- | TsEof
- | TsError LexicalError
- streamToList :: TokenStream tok -> [tok]
- streamToEitherList :: TokenStream tok -> Either LexicalError [tok]
- data LexicalError = LexicalError !Pos
Building a Lexer
A Lexer specification consists of two recognizers: one for
meaningful tokens and one for whitespace and comments.
Although you can construct Lexers directly, it is more convenient to
build them with token, whitespace, and the Monoid instance like this:
myLexer ::LexerMyToken myLexer =mconcat[token(longestmyToken) ,whitespace(longestmyWhiteSpace) ,whitespace(longestShortestmyComment) ]
Constructors
| Lexer | |
Fields
| |
token :: Recognizer tok -> Lexer tok Source
whitespace :: Recognizer a -> Lexer tok Source
Build a lexer with the given whitespace recognizer and no (i.e. mempty)
token recognizer.
whitespace is a monoid homomorphism:
whitespacea<>whitespaceb =whitespace(a<>b)
Building Recognizers
data Recognizer tok Source
A token recognizer
Recognizer values are constructed by functions like longest and
longestShortest, combined with mappend, and used by token and
whitespace.
When a recognizer returns without consuming any characters, a lexical error is signaled.
Instances
| Functor Recognizer | |
| Monoid (Recognizer tok) |
longest :: RE Char tok -> Recognizer tok Source
Arguments
| :: (pref -> suff -> tok) | |
| -> RE Char pref | regex for the longest prefix |
| -> (pref -> RE Char suff) | regex for the shortest suffix |
| -> Recognizer tok |
This is a more sophisticated recognizer than longest.
It recognizes a token consisting of a prefix and a suffix, where prefix is chosen longest, and suffix is chosen shortest.
An example would be a C block comment
/* comment text */
The naive
longest(string"/*"*>manyanySym*>string"*/")
doesn't work because it consumes too much: in
/* xxx */ yyy /* zzz */
it will treat the whole line as a comment.
This is where longestShortest comes in handy:
longestShortest(\_ _ -> ()) -- don't care about the comment text (string"/*") (\_ ->manyanySym*>string"*/")
Operationally, the prefix regex first competes with other Recognizers
for the longest match. If it wins, then the shortest match for the
suffix regex is found, and the two results are combined with the given
function to produce a token.
The two regular expressions combined must consume some input, or else
LexicalError is thrown. However, any one of them may return without
consuming input.
* * *
Once the prefix regex wins, the choice is committed; the suffix regex
must match or else a LexicalError is thrown. Therefore,
longestShortestf pref suff1<>longestShortestf pref suff2 =longestShortestf pref suff1
and is not the same as
longestShortestf pref (suff1<|>suff2)
The following holds, however:
longestShortestf pref1 suff<>longestShortestf pref2 suff =longestShortestf (pref1<|>pref2) suff
* * *
Passing the result of prefix into both suffix and combining function may seem superfluous; indeed we could get away with
RECharpref -> (pref ->REChartok) ->Recognizertok
or even
REChar(REChartok) ->Recognizertok
This is done purely for convenience and readability; the intention is
that pref passed into suffix is used to customize the regular
expression which would still return only its part of the token, and then
the function will combine the two parts. Of course, you don't need to
follow this recommendation. Thanks to parametricity, all three versions
are equivalent.
Running a Lexer
Arguments
| :: Lexer tok | lexer specification |
| -> String | source file name (used in locations) |
| -> String | source text |
| -> TokenStream (L tok) |
Run a lexer on a string and produce a lazy stream of tokens
Working with a token stream
data TokenStream tok Source
A stream of tokens
Constructors
| TsToken tok (TokenStream tok) | |
| TsEof | |
| TsError LexicalError |
Instances
| Functor TokenStream | |
| IsList (TokenStream tok) | |
| Eq tok => Eq (TokenStream tok) | |
| Show tok => Show (TokenStream tok) | |
| type Item (TokenStream tok) = tok |
streamToList :: TokenStream tok -> [tok] Source
Convert a TokenStream to a list of tokens. Turn TsError into
a runtime LexicalError exception.
streamToEitherList :: TokenStream tok -> Either LexicalError [tok] Source
Convert a TokenStream into either a token list or a LexicalError.
This function may be occasionally useful, but in general its use is
discouraged because it needs to force the whole stream before returning
a result.