Safe Haskell	None
Language	Haskell2010

Pipes.Text.Encoding

Contents

The Lens or Codec type
'Viewing' the Text in a byte stream
Non-lens decoding functions
Re-encoding functions
Functions for latin and ascii text

Description

This module uses the stream decoding functions from Michael Snoyman's new text-stream-decode package to define decoding functions and lenses. The exported names conflict with names in Data.Text.Encoding but the module can otherwise be imported unqualified.

Synopsis

The Lens or Codec type

The Codec type is a simple specializion of the Lens' type synonymn used by the standard lens libraries, lens and lens-family. That type,

  type Lens' a b = forall f . Functor f => (b -> f b) -> (a -> f a)

is just an alias for a Prelude type. Thus you use any particular codec with the view / (^.) , zoom and over functions from either of those libraries; we presuppose neither since we already have access to the types they require.

type Codec = forall m r. Monad m => Lens' (Producer ByteString m r) (Producer Text m (Producer ByteString m r))

decode :: ((b -> Constant b b) -> a -> Constant b a) -> a -> b

decode is just the ordinary view or (^.) of the lens libraries; exported here under a name appropriate to the material. All of these are the same:

   decode utf8 p = decodeUtf8 p = view utf8 p = p ^. utf8

'Viewing' the Text in a byte stream

Each Codec-lens looks into a byte stream that is supposed to contain text. The particular 'Codec' lenses are named in accordance with the expected encoding, utf8, utf16LE etc. To turn a Codec into an ordinary function, use view / (^.) -- here also called decode:

  view utf8 :: Producer ByteString m r -> Producer Text m (Producer ByteString m r)
  decode utf8 Byte.stdin :: Producer Text IO (Producer ByteString IO r)
  Bytes.stdin ^. utf8 ::  Producer Text IO (Producer ByteString IO r)

Uses of a codec with view or (^.) or decode can always be replaced by the specialized decoding functions exported here, e.g.

  decodeUtf8 ::  Producer ByteString m r -> Producer Text m (Producer ByteString m r)
  decodeUtf8 Byte.stdin :: Producer Text IO (Producer ByteString IO r)

The stream of text that a Codec 'sees' in the stream of bytes begins at its head. At any point of decoding failure, the stream of text ends and reverts to (returns) the original byte stream. Thus if the first bytes are already un-decodable, the whole ByteString producer will be returned, i.e.

  view utf8 bytestream

will just come to the same as

  return bytestream

Where there is no decoding failure, the return value of the text stream will be an empty byte stream followed by its own return value. In all cases you must deal with the fact that it is a ByteString producer that is returned, even if it can be thrown away with Control.Monad.void

  void (Bytes.stdin ^. utf8) :: Producer Text IO ()

zoom converts a Text parser into a ByteString parser:

  zoom utf8 drawChar :: Monad m => StateT (Producer ByteString m r) m (Maybe Char)

or, using the type synonymn from Pipes.Parse:

  zoom utf8 drawChar :: Monad m => Parser ByteString m (Maybe Char)

Thus we can define a ByteString parser like this:

  withNextByte :: Parser ByteString m (Maybe Char, Maybe Word8))) 
  withNextByte = do char_ <- zoom utf8 Text.drawChar
                    byte_ <- Bytes.peekByte
                    return (char_, byte_)

Though withNextByte is partly defined with a Text parser drawChar; but it is a ByteString parser; it will return the first valid utf8-encoded Char in a ByteString, whatever its length, and the first byte of the next character, if they exist. Because we 'draw' one and 'peek' at the other, the parser as a whole only advances one Char's length along the bytestring, whatever that length may be. See the slightly more complex example 'decode.hs' in the haskellforall discussion of this type of byte stream parsing.

utf8 :: Codec

utf8Pure :: Codec

utf16LE :: Codec

utf16BE :: Codec

utf32LE :: Codec

utf32BE :: Codec

Non-lens decoding functions

These are functions with the simple type:

  decodeUtf8 :: Monad m => Producer ByteString m r -> Producer Text m (Producer ByteString m r)

Thus in general

    decodeUtf8 = view utf8
    decodeUtf16LE = view utf16LE

and so forth, but these forms may be more convenient (and give better type errors!) where lenses are not desired.

decodeUtf8 :: Monad m => Producer ByteString m r -> Producer Text m (Producer ByteString m r)

decodeUtf8Pure :: Monad m => Producer ByteString m r -> Producer Text m (Producer ByteString m r)

decodeUtf16LE :: Monad m => Producer ByteString m r -> Producer Text m (Producer ByteString m r)

decodeUtf16BE :: Monad m => Producer ByteString m r -> Producer Text m (Producer ByteString m r)

decodeUtf32LE :: Monad m => Producer ByteString m r -> Producer Text m (Producer ByteString m r)

decodeUtf32BE :: Monad m => Producer ByteString m r -> Producer Text m (Producer ByteString m r)

Re-encoding functions

These are simply defined

     encodeUtf8 = yield . TE.encodeUtf8

They are intended for use with for

     for Text.stdin encodeUtf8 :: Producer ByteString IO ()

which would have the effect of

     Text.stdin >-> Pipes.Prelude.map (TE.encodeUtf8)

using the encoding functions from Data.Text.Encoding

encodeUtf8 :: Monad m => Text -> Producer' ByteString m ()

encodeUtf16LE :: Monad m => Text -> Producer' ByteString m ()

encodeUtf16BE :: Monad m => Text -> Producer' ByteString m ()

encodeUtf32LE :: Monad m => Text -> Producer' ByteString m ()

encodeUtf32BE :: Monad m => Text -> Producer' ByteString m ()

Functions for latin and ascii text

ascii and latin encodings only use a small number of the characters Text recognizes; thus we cannot use the pipes Lens style to work with them. Rather we simply define functions each way.

encodeAscii :: Monad m => Producer Text m r -> Producer ByteString m (Producer Text m r)

encodeAscii reduces as much of your stream of Text actually is ascii to a byte stream, returning the rest of the Text at the first non-ascii Char

decodeAscii :: Monad m => Producer ByteString m r -> Producer Text m (Producer ByteString m r)

Reduce a byte stream to a corresponding stream of ascii chars, returning the unused ByteString upon hitting an un-ascii byte.

encodeIso8859_1 :: Monad m => Producer Text m r -> Producer ByteString m (Producer Text m r)

Reduce as much of your stream of Text actually is iso8859 or latin1 to a byte stream, returning the rest of the Text upon hitting any non-latin Char

decodeIso8859_1 :: Monad m => Producer ByteString m r -> Producer Text m (Producer ByteString m r)

Reduce a byte stream to a corresponding stream of ascii chars, returning the unused ByteString upon hitting the rare un-latinizable byte.