Safe Haskell | None |
---|---|
Language | Haskell2010 |
For some background, see https://ro-che.info/articles/2015-01-02-lexical-analysis
- data Lexer tok = Lexer {
- lexerTokenRE :: Recognizer tok
- lexerWhitespaceRE :: Recognizer ()
- token :: Recognizer tok -> Lexer tok
- whitespace :: Recognizer a -> Lexer tok
- data Recognizer tok
- longest :: RE Char tok -> Recognizer tok
- longestShortest :: (pref -> suff -> tok) -> RE Char pref -> (pref -> RE Char suff) -> Recognizer tok
- runLexer :: forall tok. Lexer tok -> String -> String -> TokenStream (L tok)
- data TokenStream tok
- = TsToken tok (TokenStream tok)
- | TsEof
- | TsError LexicalError
- streamToList :: TokenStream tok -> [tok]
- streamToEitherList :: TokenStream tok -> Either LexicalError [tok]
- data LexicalError = LexicalError !Pos
Building a Lexer
A Lexer
specification consists of two recognizers: one for
meaningful tokens and one for whitespace and comments.
Although you can construct Lexer
s directly, it is more convenient to
build them with token
, whitespace
, and the Monoid
instance like this:
myLexer ::Lexer
MyToken myLexer =mconcat
[token
(longest
myToken) ,whitespace
(longest
myWhiteSpace) ,whitespace
(longestShortest
myComment) ]
Lexer | |
|
token :: Recognizer tok -> Lexer tok Source
whitespace :: Recognizer a -> Lexer tok Source
Build a lexer with the given whitespace recognizer and no (i.e. mempty
)
token recognizer.
whitespace
is a monoid homomorphism:
whitespace
a<>
whitespace
b =whitespace
(a<>
b)
Building Recognizers
data Recognizer tok Source
A token recognizer
Recognizer
values are constructed by functions like longest
and
longestShortest
, combined with mappend
, and used by token
and
whitespace
.
When a recognizer returns without consuming any characters, a lexical error is signaled.
Functor Recognizer | |
Monoid (Recognizer tok) |
longest :: RE Char tok -> Recognizer tok Source
:: (pref -> suff -> tok) | |
-> RE Char pref | regex for the longest prefix |
-> (pref -> RE Char suff) | regex for the shortest suffix |
-> Recognizer tok |
This is a more sophisticated recognizer than longest
.
It recognizes a token consisting of a prefix and a suffix, where prefix is chosen longest, and suffix is chosen shortest.
An example would be a C block comment
/* comment text */
The naive
longest
(string
"/*"*>
many
anySym
*>
string
"*/")
doesn't work because it consumes too much: in
/* xxx */ yyy /* zzz */
it will treat the whole line as a comment.
This is where longestShortest
comes in handy:
longestShortest
(\_ _ -> ()) -- don't care about the comment text (string
"/*") (\_ ->many
anySym
*>
string
"*/")
Operationally, the prefix regex first competes with other Recognizer
s
for the longest match. If it wins, then the shortest match for the
suffix regex is found, and the two results are combined with the given
function to produce a token.
The two regular expressions combined must consume some input, or else
LexicalError
is thrown. However, any one of them may return without
consuming input.
* * *
Once the prefix regex wins, the choice is committed; the suffix regex
must match or else a LexicalError
is thrown. Therefore,
longestShortest
f pref suff1<>
longestShortest
f pref suff2 =longestShortest
f pref suff1
and is not the same as
longestShortest
f pref (suff1<|>
suff2)
The following holds, however:
longestShortest
f pref1 suff<>
longestShortest
f pref2 suff =longestShortest
f (pref1<|>
pref2) suff
* * *
Passing the result of prefix into both suffix and combining function may seem superfluous; indeed we could get away with
RE
Char
pref -> (pref ->RE
Char
tok) ->Recognizer
tok
or even
RE
Char
(RE
Char
tok) ->Recognizer
tok
This is done purely for convenience and readability; the intention is
that pref
passed into suffix is used to customize the regular
expression which would still return only its part of the token, and then
the function will combine the two parts. Of course, you don't need to
follow this recommendation. Thanks to parametricity, all three versions
are equivalent.
Running a Lexer
:: Lexer tok | lexer specification |
-> String | source file name (used in locations) |
-> String | source text |
-> TokenStream (L tok) |
Run a lexer on a string and produce a lazy stream of tokens
Working with a token stream
data TokenStream tok Source
A stream of tokens
TsToken tok (TokenStream tok) | |
TsEof | |
TsError LexicalError |
Functor TokenStream | |
IsList (TokenStream tok) | |
Eq tok => Eq (TokenStream tok) | |
Show tok => Show (TokenStream tok) | |
type Item (TokenStream tok) = tok |
streamToList :: TokenStream tok -> [tok] Source
Convert a TokenStream
to a list of tokens. Turn TsError
into
a runtime LexicalError
exception.
streamToEitherList :: TokenStream tok -> Either LexicalError [tok] Source
Convert a TokenStream
into either a token list or a LexicalError
.
This function may be occasionally useful, but in general its use is
discouraged because it needs to force the whole stream before returning
a result.
data LexicalError Source
The lexical error exception