tokenizer-streaming: A variant of tokenizer-monad that supports streaming.

[ gpl, library, text ] [ Propose Tags ]

This monad transformer is a modification of tokenizer-monad that can work on streams of text/string chunks or even on (Unicode) bytestring streams. Thus


[Skip to Readme]

Downloads

Note: This package has metadata revisions in the cabal description newer than included in the tarball. To unpack the package including the revisions, use 'cabal get'.

Maintainer's Corner

Package maintainers

For package maintainers and hackage trustees

Candidates

Versions [RSS] 0.1.0.0, 0.1.0.1
Change log CHANGELOG.md
Dependencies base (>=4.9 && <5.0), bytestring, mtl, streaming, streaming-bytestring (>=0.1.6), streaming-commons (>=0.2.1.0 && <0.3), text, tokenizer-monad (>=0.2.2.0 && <0.3) [details]
License GPL-3.0-only
Copyright (c) 2019 Enum Cohrs
Author Enum Cohrs
Maintainer darcs@enumeration.eu
Revised Revision 1 made by implementation at 2019-01-22T21:35:56Z
Category Text
Source repo head: darcs get https://hub.darcs.net/enum/tokenizer-streaming
Uploaded by implementation at 2019-01-22T21:35:09Z
Distributions
Downloads 878 total (9 in the last 30 days)
Rating (no votes yet) [estimated by Bayesian average]
Your Rating
  • λ
  • λ
  • λ
Status Docs available [build log]
Last success reported on 2019-01-22 [all 2 reports]

Readme for tokenizer-streaming-0.1.0.0

[back to package description]

tokenizer-streaming

Motivation: You might have stumpled upon the package tokenizer-monad. It is another project by me, for writing tokenizers that act on pure text/strings. However, there are situations when you cannot keep all the text in memory. You might want to tokenize text from network streams or from large corpus files.

Main idea: A monad transformer called TokenizerT implements exactly the same methods as Tokenizer from tokenizer-monad, such that all tokenizers can be ported without code changes (if you used MonadTokenizer in the type signatures)

Supported text types

  • streams of Char lists can be tokenized into streams of Char lists
  • streams of strict Text can be tokenized into streams of strict Text
  • streams of lazy Text can be tokenized into streams of lazy Text
  • streams of strict ASCII ByteStrings can be tokenized into streams of strict ASCII ByteStrings
  • streams of lazy ASCII ByteStrings can be tokenized into streams of lazy ASCII ByteStrings
  • bytestring streams (from streaming-bytestring) with Unicode encodings (UTF-8, UTF-16 LE & BE, UTF-32 LE & BE) can be tokenized into streams of strict Text