tokenizer-streaming: A variant of tokenizer-monad that supports streaming.

[ gpl, library, text ] [ Propose Tags ]

This monad transformer is a modification of tokenizer-monad that can work on streams of text/string chunks or even on (Unicode) bytestring streams.


[Skip to Readme]
Versions [faq] 0.1.0.0, 0.1.0.1
Change log CHANGELOG.md
Dependencies base (>=4.9 && <5.0), bytestring, mtl, streaming, streaming-bytestring (>=0.1.6), streaming-commons (>=0.2.1.0 && <0.3), text, tokenizer-monad (>=0.2.2.0 && <0.3) [details]
License GPL-3.0-only
Copyright (c) 2019 Enum Cohrs
Author Enum Cohrs
Maintainer darcs@enumeration.eu
Category Text
Source repo head: darcs get https://hub.darcs.net/enum/tokenizer-streaming
Uploaded by implementation at Tue Jan 22 21:41:50 UTC 2019
Distributions NixOS:0.1.0.1
Downloads 155 total (11 in the last 30 days)
Rating (no votes yet) [estimated by rule of succession]
Your Rating
  • λ
  • λ
  • λ
Status Hackage Matrix CI
Docs available [build log]
Last success reported on 2019-01-22 [all 1 reports]

Modules

[Index] [Quick Jump]

Downloads

Maintainer's Corner

For package maintainers and hackage trustees


Readme for tokenizer-streaming-0.1.0.1

[back to package description]

tokenizer-streaming

Motivation: You might have stumpled upon the package tokenizer-monad. It is another project by me, for writing tokenizers that act on pure text/strings. However, there are situations when you cannot keep all the text in memory. You might want to tokenize text from network streams or from large corpus files.

Main idea: A monad transformer called TokenizerT implements exactly the same methods as Tokenizer from tokenizer-monad, such that all tokenizers can be ported without code changes (if you used MonadTokenizer in the type signatures)

Supported text types

  • streams of Char lists can be tokenized into streams of Char lists
  • streams of strict Text can be tokenized into streams of strict Text
  • streams of lazy Text can be tokenized into streams of lazy Text
  • streams of strict ASCII ByteStrings can be tokenized into streams of strict ASCII ByteStrings
  • streams of lazy ASCII ByteStrings can be tokenized into streams of lazy ASCII ByteStrings
  • bytestring streams (from streaming-bytestring) with Unicode encodings (UTF-8, UTF-16 LE & BE, UTF-32 LE & BE) can be tokenized into streams of strict Text