# tokenizer-monad

**Motivation**: Before working with tokenizer-monad, I often implemented
tokenizers by recursively destroying Char lists. The resulting code was
purely functional, but hardly readable - even more so, if one destroys
Text instead of Char lists. In my mind, I usually imagine tokenization
algorithms like flow charts, hence I wanted to code them in a similar manner.

**Main idea**: You `walk` through the input string like a turtle, and everytime
you find a token boundary, you call `emit`. If some specific kinds of tokens
should be suppressed, you can 'discard' them instead (or filter afterwards).

This package supports Strings, strict and lazy Text, as well as strict and
lazy ASCII ByteStrings. The generic `Tokenizer` type from module `Control.Monad.Tokenizer` takes the text type as its first argument. In the other modules, `Tokenizer` is
already specialized to a specific text type. It is recommendable to avoid
importing more than one module from this package. Instead, you could just switch to a more general one.

## Provided functions

The functions provided by this package can be divided into three categories:

 - *tests* peek characters from the input text or check a condition; they have
   no effect on the turtle position nor on the emissions. Examples: `peek`,
   `isEOT`, `lookAhead`
 - *walkers* modify the turtle position, but have no effect on the emissions.
   Examples: `walk`, `walkBack`, `pop`, `restore`
 - *commits* cut off the input text at the current position, and emit or
   discard the visited part. Examples: `emit`, `discard`

## Examples

This tokenizer is equivalent to `words` from Prelude:

     words' :: String -> [String]
     words' = runTokenizerCS $ untilEOT $ do
       c <- pop
       if c `elem` " \t\n\r"
         then discard
         else do
           walkWhile (not . isSpace)
           emit

    ...> words' "Dieses Haus ist blau."
    ["Dieses","Haus","ist","blau."]

This tokenizer is similar to `lines` from Prelude, but discards empty lines:

     lines' :: String -> [String]
     lines' = runTokenizerCS $ untilEOT $ do
       c <- pop
       if c `elem` "\n\r"
         then discard
         else do
           walkWhile (\c -> not (c `elem` "\r\n"))
           emit

    ...> lines' "Dieses Haus ist\n\nblau.\n"
    ["Dieses Haus ist","blau."]

A more advanced tokenizer, that can handle punctuation and HTTP URIs in text:

    t1Tokenize' :: Tokenizer Text ()
    t1Tokenize' = do
      http <- lookAhead "http://"
      https <- lookAhead "https://"
      if (http || https)
         then (walkWhile (not . isSpace) >> discard)
         else do
           c <- peek
           walk
           if isStopSym c
             then emit
             else if c `elem` (" \t\r\n" :: [Char])
                  then discard
                  else do
                    walkWhile (\c -> (c=='_') || not (isSpace c || isPunctuation c))
                    emit