| Safe Haskell | None |
|---|---|
| Language | Haskell2010 |
Language.Lexer.Applicative
Description
For some background, see https://ro-che.info/articles/2015-01-02-lexical-analysis
- data Lexer tok = Lexer {
- lexerTokenRE :: Recognizer tok
- lexerWhitespaceRE :: Recognizer ()
- token :: Recognizer tok -> Lexer tok
- whitespace :: Recognizer a -> Lexer tok
- data Recognizer tok
- longest :: RE Char tok -> Recognizer tok
- longestShortest :: RE Char pref -> (pref -> RE Char tok) -> Recognizer tok
- runLexer :: forall tok. Lexer tok -> String -> String -> TokenStream (L tok)
- data TokenStream tok
- = TsToken tok (TokenStream tok)
- | TsEof
- | TsError LexicalError
- streamToList :: TokenStream tok -> [tok]
- streamToEitherList :: TokenStream tok -> Either LexicalError [tok]
- data LexicalError = LexicalError !Pos
Building a Lexer
A Lexer specification consists of two recognizers: one for
meaningful tokens and one for whitespace and comments.
Although you can construct Lexers directly, it is more convenient to
build them with token, whitespace, and the Monoid instance like this:
myLexer ::LexerMyToken myLexer =mconcat[token(longestmyToken) ,whitespace(longestmyWhiteSpace) ,whitespace(longestShortestmyCommentPrefix myCommentSuffix) ]
Constructors
| Lexer | |
Fields
| |
token :: Recognizer tok -> Lexer tok Source #
whitespace :: Recognizer a -> Lexer tok Source #
Build a lexer with the given whitespace recognizer and no (i.e. mempty)
token recognizer.
whitespace is a monoid homomorphism:
whitespacea<>whitespaceb =whitespace(a<>b)
Building Recognizers
data Recognizer tok Source #
A token recognizer
Recognizer values are constructed by functions like longest and
longestShortest, combined with mappend, and used by token and
whitespace.
When a recognizer returns without consuming any characters, a lexical error is signaled.
Instances
| Functor Recognizer Source # | |
| Semigroup (Recognizer tok) Source # | |
| Monoid (Recognizer tok) Source # | |
Arguments
| :: RE Char pref | regex for the longest prefix |
| -> (pref -> RE Char tok) | regex for the shortest suffix |
| -> Recognizer tok |
This is a more sophisticated recognizer than longest.
It recognizes a token consisting of a prefix and a suffix, where prefix is chosen longest, and suffix is chosen shortest.
An example would be a C block comment
/* comment text */
The naive
longest(string"/*"*>manyanySym*>string"*/")
doesn't work because it consumes too much: in
/* xxx */ yyy /* zzz */
it will treat the whole line as a comment.
This is where longestShortest comes in handy:
longestShortest(string"/*") (\_ ->manyanySym*>string"*/")
Operationally, the prefix regex first competes with other Recognizers
for the longest match. If it wins, then the shortest match for the
suffix regex is found, and the two results are combined with the given
function to produce a token.
The two regular expressions combined must consume some input, or else
LexicalError is thrown. However, any one of them may return without
consuming input.
* * *
Once the prefix regex wins, the choice is committed; the suffix regex
must match or else a LexicalError is thrown. Therefore,
longestShortestpref suff1<>longestShortestpref suff2 =longestShortestpref suff1
and is not the same as
longestShortestpref (suff1<|>suff2)
The following holds, however:
longestShortestpref1 suff<>longestShortestpref2 suff =longestShortest(pref1<|>pref2) suff
Running a Lexer
Arguments
| :: Lexer tok | lexer specification |
| -> String | source file name (used in locations) |
| -> String | source text |
| -> TokenStream (L tok) |
Run a lexer on a string and produce a lazy stream of tokens
Working with a token stream
data TokenStream tok Source #
A stream of tokens
Constructors
| TsToken tok (TokenStream tok) | |
| TsEof | |
| TsError LexicalError |
Instances
| Functor TokenStream Source # | |
| Eq tok => Eq (TokenStream tok) Source # | |
| Show tok => Show (TokenStream tok) Source # | |
streamToList :: TokenStream tok -> [tok] Source #
Convert a TokenStream to a list of tokens. Turn TsError into
a runtime LexicalError exception.
streamToEitherList :: TokenStream tok -> Either LexicalError [tok] Source #
Convert a TokenStream into either a token list or a LexicalError.
This function may be occasionally useful, but in general its use is
discouraged because it needs to force the whole stream before returning
a result.