Contents

Description

A monad for writing pure tokenizers in an imperative-looking way.

Main idea: You walk through the input string like a turtle, and everytime you find a token boundary, you call emit. If some specific kinds of tokens should be suppressed, you can discard them instead (or filter afterwards).

This module supports is specialized for lazy text. The module Control.Monad.Tokenizer provides more general types.

Example for a simple tokenizer, that splits words by whitespace and discards stop symbols:

tokenizeWords :: LT.Text -> [LT.Text]
tokenizeWords = runTokenizer $untilEOT$ do
c <- pop
if isStopSym c
else if c elem ("  \t\r\n" :: [Char])
else do
walkWhile (\c -> (c=='_') || not (isSpace c || isPunctuation' c))
emit
Synopsis

Tokenizer monad. Use runTokenizer or runTokenizerCS to run it

runTokenizer :: Tokenizer () -> Text -> [Text] Source #

Split a string into tokens using the given tokenizer

runTokenizerCS :: Tokenizer () -> Text -> [Text] Source #

Split a string into tokens using the given tokenizer, case sensitive version

untilEOT :: Tokenizer () -> Tokenizer () Source #

Repeat a given tokenizer as long as the end of text is not reached

# Tests

Peek the current character

Have I reached the end of the input text?

Check if the next input chars agree with the given string

# Movement

Proceed to the next character

Walk back to the previous character, unless it was discarded/emitted.

Peek the current character and proceed

walkWhile :: (Char -> Bool) -> Tokenizer () Source #

Proceed as long as a given function succeeds

walkFold :: a -> (Char -> a -> Maybe a) -> Tokenizer () Source #

Proceed as long as a given fold returns Just (generalization of walkWhile)

# Transactions

Break at the current position and emit the scanned token

Break at the current position and discard the scanned token

Restore the state after the last emit/discard.