tokenizer-monad: An efficient and easy-to-use tokenizer monad.

[ gpl, library, text ] [ Propose Tags ]

This monad can be used for writing efficient and readable tokenizers in an imperative way.


[Skip to Readme]
Versions [faq] 0.1.0.0, 0.2.0.0, 0.2.1.0, 0.2.2.0
Change log ChangeLog.md
Dependencies base (>=4.9 && <5), bytestring, text (>=1.2) [details]
License GPL-3.0-only
Copyright (c) 2017-2019 Enum Cohrs
Author Enum Cohrs
Maintainer darcs@enumeration.eu
Revised Revision 1 made by implementation at Sun Jan 20 20:37:49 UTC 2019
Category Text
Uploaded by implementation at Sun Jan 20 19:55:59 UTC 2019
Distributions NixOS:0.2.2.0
Downloads 407 total (73 in the last 30 days)
Rating (no votes yet) [estimated by rule of succession]
Your Rating
  • λ
  • λ
  • λ
Status Hackage Matrix CI
Docs not available [build log]
All reported builds failed as of 2019-01-20 [all 2 reports]

Modules

  • Control
    • Monad
      • Control.Monad.Tokenizer
        • Char8
          • Control.Monad.Tokenizer.Char8.Lazy
          • Control.Monad.Tokenizer.Char8.Strict
        • Control.Monad.Tokenizer.String
        • Text
          • Control.Monad.Tokenizer.Text.Lazy
          • Control.Monad.Tokenizer.Text.Strict

Downloads

Note: This package has metadata revisions in the cabal description newer than included in the tarball. To unpack the package including the revisions, use 'cabal get'.

Maintainer's Corner

For package maintainers and hackage trustees


Readme for tokenizer-monad-0.2.0.0

[back to package description]

tokenizer-monad

Motivation: Before working with tokenizer-monad, I often implemented tokenizers by recursively destroying Char lists. The resulting code was purely functional, but hardly readable - even more so, if one destroys Text instead of Char lists. In my mind, I usually imagine tokenization algorithms like flow charts, hence I wanted to code them in a similar manner.

Main idea: You walk through the input string like a turtle, and everytime you find a token boundary, you call emit. If some specific kinds of tokens should be suppressed, you can 'discard' them instead (or filter afterwards).

This package supports Strings, strict and lazy Text, as well as strict and lazy ASCII ByteStrings.

Examples:

This tokenizer is equivalent to words from Prelude:

 words' :: String -> [String]
 words' = runTokenizerCS $ untilEOT $ do
   c <- pop
   if c `elem` " \t\n\r"
     then discard
     else do
       walkWhile (not . isSpace)
       emit

...> words' "Dieses Haus ist blau."
["Dieses","Haus","ist","blau."]

This tokenizer is similar to lines from Prelude, but discards empty lines:

 lines' :: String -> [String]
 lines' = runTokenizerCS $ untilEOT $ do
   c <- pop
   if c `elem` "\n\r"
     then discard
     else do
       walkWhile (\c -> not (c `elem` "\r\n"))
       emit

...> lines' "Dieses Haus ist\n\nblau.\n"
["Dieses Haus ist","blau."]

A more advanced tokenizer, that can handle punctuation and HTTP URIs in text:

t1Tokenize' :: Tokenizer Text ()
t1Tokenize' = do
  http <- lookAhead "http://"
  https <- lookAhead "https://"
  if (http || https)
     then (walkWhile (not . isSpace) >> discard)
     else do
       c <- peek
       walk
       if isStopSym c
         then emit
         else if c `elem` (" \t\r\n" :: [Char])
              then discard
              else do
                walkWhile (\c -> (c=='_') || not (isSpace c || isPunctuation c))
                emit