tokenizer-monad: An efficient and easy-to-use tokenizer monad.

[ gpl, library, text ] [ Propose Tags ]

This monad can be used for writing efficient and readable tokenizers in an imperative way.

[Skip to Readme]

Modules

[Index] [Quick Jump]

Control
- Monad
  - Control.Monad.Tokenizer

Downloads

tokenizer-monad-0.2.1.0.tar.gz [browse] (Cabal source package)
Package description (as included in the package)

Maintainer's Corner

Package maintainers

implementation

For package maintainers and hackage trustees

edit package information

Candidates

0.1.0.0, 0.2.0.0, 0.2.1.0, 0.2.2.0

Versions [RSS]	0.1.0.0, 0.2.0.0, 0.2.1.0, 0.2.2.0
Change log	ChangeLog.md
Dependencies	base (>=4.9 && <5), bytestring, text (>=1.2) [details]
License	GPL-3.0-only
Copyright	(c) 2017-2019 Enum Cohrs
Author	Enum Cohrs
Maintainer	darcs@enumeration.eu
Category	Text
Source repo	head: darcs get https://hub.darcs.net/enum/tokenizer-monad
Uploaded	by implementation at 2019-01-20T21:41:11Z
Distributions	NixOS:0.2.2.0
Reverse Dependencies	2 direct, 0 indirect [details]
Downloads	2053 total (15 in the last 30 days)
Rating	(no votes yet) [estimated by Bayesian average]
Your Rating	λ λ λ
Status	Docs available [build log] Last success reported on 2019-01-20 [all 1 reports]

Readme for tokenizer-monad-0.2.1.0

[back to package description]

tokenizer-monad

Motivation: Before working with tokenizer-monad, I often implemented tokenizers by recursively destroying Char lists. The resulting code was purely functional, but hardly readable - even more so, if one destroys Text instead of Char lists. In my mind, I usually imagine tokenization algorithms like flow charts, hence I wanted to code them in a similar manner.

Main idea: You walk through the input string like a turtle, and everytime you find a token boundary, you call emit. If some specific kinds of tokens should be suppressed, you can 'discard' them instead (or filter afterwards).

This package supports Strings, strict and lazy Text, as well as strict and lazy ASCII ByteStrings. The generic Tokenizer type from module Control.Monad.Tokenizer takes the text type as its first argument. In the other modules, Tokenizer is already specialized to a specific text type. It is recommendable to avoid importing more than one module from this package. Instead, you could just switch to a more general one.

Provided functions

The functions provided by this package can be divided into three categories:

tests peek characters from the input text or check a condition; they have no effect on the turtle position nor on the emissions. Examples: peek, isEOT, lookAhead
walkers modify the turtle position, but have no effect on the emissions. Examples: walk, walkBack, pop, restore
commits cut off the input text at the current position, and emit or discard the visited part. Examples: emit, discard

Examples

This tokenizer is equivalent to words from Prelude:

 words' :: String -> [String]
 words' = runTokenizerCS $ untilEOT $ do
   c <- pop
   if c `elem` " \t\n\r"
     then discard
     else do
       walkWhile (not . isSpace)
       emit

...> words' "Dieses Haus ist blau."
["Dieses","Haus","ist","blau."]

This tokenizer is similar to lines from Prelude, but discards empty lines:

 lines' :: String -> [String]
 lines' = runTokenizerCS $ untilEOT $ do
   c <- pop
   if c `elem` "\n\r"
     then discard
     else do
       walkWhile (\c -> not (c `elem` "\r\n"))
       emit

...> lines' "Dieses Haus ist\n\nblau.\n"
["Dieses Haus ist","blau."]

A more advanced tokenizer, that can handle punctuation and HTTP URIs in text:

t1Tokenize' :: Tokenizer Text ()
t1Tokenize' = do
  http <- lookAhead "http://"
  https <- lookAhead "https://"
  if (http || https)
     then (walkWhile (not . isSpace) >> discard)
     else do
       c <- peek
       walk
       if isStopSym c
         then emit
         else if c `elem` (" \t\r\n" :: [Char])
              then discard
              else do
                walkWhile (\c -> (c=='_') || not (isSpace c || isPunctuation c))
                emit