Data.Monoid.Lexical.UTF8.Decoder

monoids-0.1.17: Monoids, specialized containers and a general map/reduce framework

Portability	non-portable (MPTCs)
Stability	experimental
Maintainer	ekmett@gmail.com

Description

UTF8 encoded unicode characters can be parsed both forwards and backwards, since the start of each Char is clearly marked. This Monoid accumulates information about the characters represented and reduces that information using a CharReducer, which is just a Reducer Monoid that knows what it wants to do about an invalidChar -- a string of Word8 values that don't form a valid UTF8 character.

As this monoid parses chars it just feeds them upstream to the underlying CharReducer. Efficient left-to-right and right-to-left traversals are supplied so that a lazy ByteString can be parsed efficiently by chunking it into strict chunks, and batching the traversals over each before stitching the edges together.

Because this needs to be a Monoid and should return the exact same result regardless of forward or backwards parsing, it chooses to parse only canonical UTF8 unlike most Haskell UTF8 parsers, which will blissfully accept illegal alternative long encodings of a character.

This actually fixes a potential class of security issues in some scenarios:

http://prowebdevelopmentblog.com/content/big-overhaul-java-utf-8-charset

NB: Due to naive use of a list to track the tail of an unfinished character this may exhibit O(n^2) behavior parsing backwards along an invalid sequence of a large number of bytes that all claim to be in the tail of a character.

Documentation

module Data.Monoid.Reducer.Char

data UTF8 m

Source

Instances

Functor UTF8

Pointed UTF8

CharReducer m => Reducer Word8 (UTF8 m)

CharReducer m => Monoid (UTF8 m)

runUTF8 :: CharReducer m => UTF8 m -> m

Source

Produced by Haddock version 2.4.1