Safe Haskell | None |
---|---|
Language | Haskell2010 |
For some background, see https://ro-che.info/articles/2015-01-02-lexical-analysis
- data Lexer tok = Lexer {
- lexerTokenRE :: Recognizer tok
- lexerWhitespaceRE :: Recognizer ()
- token :: Recognizer tok -> Lexer tok
- whitespace :: Recognizer a -> Lexer tok
- data Recognizer tok
- longest :: RE Char tok -> Recognizer tok
- longestShortest :: RE Char pref -> (pref -> RE Char tok) -> Recognizer tok
- runLexer :: forall tok. Lexer tok -> String -> String -> TokenStream (L tok)
- data TokenStream tok
- = TsToken tok (TokenStream tok)
- | TsEof
- | TsError LexicalError
- streamToList :: TokenStream tok -> [tok]
- streamToEitherList :: TokenStream tok -> Either LexicalError [tok]
- data LexicalError = LexicalError !Pos
Building a Lexer
A Lexer
specification consists of two recognizers: one for
meaningful tokens and one for whitespace and comments.
Although you can construct Lexer
s directly, it is more convenient to
build them with token
, whitespace
, and the Monoid
instance like this:
myLexer ::Lexer
MyToken myLexer =mconcat
[token
(longest
myToken) ,whitespace
(longest
myWhiteSpace) ,whitespace
(longestShortest
myCommentPrefix myCommentSuffix) ]
Lexer | |
|
token :: Recognizer tok -> Lexer tok Source #
whitespace :: Recognizer a -> Lexer tok Source #
Build a lexer with the given whitespace recognizer and no (i.e. mempty
)
token recognizer.
whitespace
is a monoid homomorphism:
whitespace
a<>
whitespace
b =whitespace
(a<>
b)
Building Recognizers
data Recognizer tok Source #
A token recognizer
Recognizer
values are constructed by functions like longest
and
longestShortest
, combined with mappend
, and used by token
and
whitespace
.
When a recognizer returns without consuming any characters, a lexical error is signaled.
Functor Recognizer Source # | |
Semigroup (Recognizer tok) Source # | |
Monoid (Recognizer tok) Source # | |
:: RE Char pref | regex for the longest prefix |
-> (pref -> RE Char tok) | regex for the shortest suffix |
-> Recognizer tok |
This is a more sophisticated recognizer than longest
.
It recognizes a token consisting of a prefix and a suffix, where prefix is chosen longest, and suffix is chosen shortest.
An example would be a C block comment
/* comment text */
The naive
longest
(string
"/*"*>
many
anySym
*>
string
"*/")
doesn't work because it consumes too much: in
/* xxx */ yyy /* zzz */
it will treat the whole line as a comment.
This is where longestShortest
comes in handy:
longestShortest
(string
"/*") (\_ ->many
anySym
*>
string
"*/")
Operationally, the prefix regex first competes with other Recognizer
s
for the longest match. If it wins, then the shortest match for the
suffix regex is found, and the two results are combined with the given
function to produce a token.
The two regular expressions combined must consume some input, or else
LexicalError
is thrown. However, any one of them may return without
consuming input.
* * *
Once the prefix regex wins, the choice is committed; the suffix regex
must match or else a LexicalError
is thrown. Therefore,
longestShortest
pref suff1<>
longestShortest
pref suff2 =longestShortest
pref suff1
and is not the same as
longestShortest
pref (suff1<|>
suff2)
The following holds, however:
longestShortest
pref1 suff<>
longestShortest
pref2 suff =longestShortest
(pref1<|>
pref2) suff
Running a Lexer
:: Lexer tok | lexer specification |
-> String | source file name (used in locations) |
-> String | source text |
-> TokenStream (L tok) |
Run a lexer on a string and produce a lazy stream of tokens
Working with a token stream
data TokenStream tok Source #
A stream of tokens
TsToken tok (TokenStream tok) | |
TsEof | |
TsError LexicalError |
Functor TokenStream Source # | |
Eq tok => Eq (TokenStream tok) Source # | |
Show tok => Show (TokenStream tok) Source # | |
streamToList :: TokenStream tok -> [tok] Source #
Convert a TokenStream
to a list of tokens. Turn TsError
into
a runtime LexicalError
exception.
streamToEitherList :: TokenStream tok -> Either LexicalError [tok] Source #
Convert a TokenStream
into either a token list or a LexicalError
.
This function may be occasionally useful, but in general its use is
discouraged because it needs to force the whole stream before returning
a result.
data LexicalError Source #
The lexical error exception