Copyright	(c) 2020 Sam May
License	MPL-2.0
Maintainer	ag.eitilt@gmail.com
Stability	experimental
Portability	portable
Safe Haskell	Safe-Inferred
Language	Haskell98

Web.Willow.Common.Encoding

Contents

Types
Initialization
- Decoding
- Encoding
Transformations
Internal

Description

This module and the internal branch it heads implement the Encoding specification for translating text to and from UTF-8 and a selection of less-favoured but grandfathered encoding schemes. As the standard authors' primary goal has been security followed closely by compatibility with existing web pages, the algorithms described and the names associated with them do not perfectly match the descriptions originally given by the various original encoding specifications themselves.

Synopsis

data Encoding
- = Utf8
- | Utf16be
- | Utf16le
- | Big5
- | EucJp
- | EucKr
- | Gb18030
- | Gbk
- | Ibm866
- | Iso2022Jp
- | Iso8859_2
- | Iso8859_3
- | Iso8859_4
- | Iso8859_5
- | Iso8859_6
- | Iso8859_7
- | Iso8859_8
- | Iso8859_8i
- | Iso8859_10
- | Iso8859_13
- | Iso8859_14
- | Iso8859_15
- | Iso8859_16
- | Koi8R
- | Koi8U
- | Macintosh
- | MacintoshCyrillic
- | ShiftJis
- | Windows874
- | Windows1250
- | Windows1251
- | Windows1252
- | Windows1253
- | Windows1254
- | Windows1255
- | Windows1256
- | Windows1257
- | Windows1258
- | Replacement
- | UserDefined
data DecoderState
decoderEncoding :: DecoderState -> Encoding
decoderRemainder :: DecoderState -> ShortByteString
data ReparseData
data EncoderState
initialDecoderState :: Encoding -> DecoderState
setEncodingCertain :: Encoding -> DecoderState -> DecoderState
setRemainder :: ShortByteString -> DecoderState -> DecoderState
initialEncoderState :: Encoding -> EncoderState
decode :: DecoderState -> ByteString -> ([Either ShortByteString String], DecoderState)
decode' :: DecoderState -> ByteString -> (Text, DecoderState)
byteOrderMark :: ByteString -> (Maybe Encoding, ByteString)
finalizeDecode :: DecoderState -> [Either ShortByteString String]
finalizeDecode' :: DecoderState -> Text
decodeUtf8 :: ByteString -> ([Either ShortByteString String], DecoderState)
decodeUtf8NoBom :: ByteString -> ([Either ShortByteString String], DecoderState)
decodeUtf8' :: ByteString -> (Text, DecoderState)
decodeUtf8NoBom' :: ByteString -> (Text, DecoderState)
encode :: EncoderState -> Text -> ([Either Char ShortByteString], EncoderState)
encode' :: EncoderState -> Text -> (ByteString, EncoderState)
encodeUtf8 :: Text -> (ByteString, EncoderState)
decodeStep :: DecoderState -> ByteString -> (Maybe (Either ShortByteString String), DecoderState, ByteString)
encodeStep :: EncoderState -> Text -> Maybe (Either Char ShortByteString, EncoderState, Text)
decodeStep' :: DecoderState -> ByteString -> (Maybe String, DecoderState, ByteString)
encodeStep' :: EncoderState -> Text -> Maybe (ShortByteString, EncoderState, Text)
data InnerDecoderState
data InnerEncoderState

Types

data Encoding Source #

Encoding: encoding

All character encoding schemes supported by the HTML standard, defined as a bidirectional map between characters and binary sequences. Utf8 is strongly encouraged for new content (including all encoding purposes), but the others are retained for compatibility with existing pages.

Note that none of these are complete functions, to one degree or another, and that no guarantee is made that the mapping round-trips.

Constructors

Utf8	The UTF-8 encoding for Unicode.
Utf16be	The UTF-16 encoding for Unicode, in big endian order. No encoder is provided for this scheme.
Utf16le	The UTF-16 encoding for Unicode, in little endian order. No encoder is provided for this scheme.
Big5	Big5, primarily covering traditional Chinese characters.
EucJp	EUC-JP, primarily covering Japanese as the union of JIS-0208 and JIS-0212.
EucKr	EUC-KR, primarily covering Hangul.
Gb18030	The GB18030-2005 extension to GBK, with one tweak for web compatibility, primarily covering both forms of Chinese characters. Note that this encoding also includes a large number of four-byte sequences which aren't listed in the linked visualization.
Gbk	GBK, primarily covering simplified Chinese characters. In practice, this is just `Gb18030` with a restricted set of encodable characters; the decoder is identical.
Ibm866	DOS and OS/2 code page for Cyrillic characters.
Iso2022Jp	A Japanese-focused implementation of the ISO 2022 meta-encoding, including both JIS-0208 and halfwidth katakana.
Iso8859_2	Latin-2 (Central European).
Iso8859_3	Latin-3 (South European and Esperanto)
Iso8859_4	Latin-4 (North European).
Iso8859_5	Latin/Cyrillic.
Iso8859_6	Latin/Arabic.
Iso8859_7	Latin/Greek (modern monotonic).
Iso8859_8	Latin/Hebrew (visual order).
Iso8859_8i	Latin/Hebrew (logical order).
Iso8859_10	Latin-6 (Nordic).
Iso8859_13	Latin-7 (Baltic Rim).
Iso8859_14	Latin-8 (Celtic).
Iso8859_15	Latin-9 (revision of ISO 8859-1 Latin-1, Western European).
Iso8859_16	Latin-10 (South-Eastern European).
Koi8R	KOI-8 specialized for Russian Cyrillic.
Koi8U	KOI-8 specialized for Ukrainian Cyrillic.
Macintosh	Mac OS Roman.
MacintoshCyrillic	Mac OS Cyrillic (as of Mac OS 9.0)
ShiftJis	The Windows variant (code page 932) of Shift JIS.
Windows874	ISO 8859-11 Latin/Thai with Windows extensions in the C1 control character slots. Note that this encoding is always used instead of pure Latin/Thai.
Windows1250	The Windows extension and rearrangement of ISO 8859-2 Latin-2.
Windows1251	Windows Cyrillic.
Windows1252	The Windows extension of ISO 8859-1 Latin-1, replacing most of the C1 control characters with printable glyphs. Note that this encoding is always used instead of pure Latin-1.
Windows1253	Windows Greek (modern monotonic).
Windows1254	The Windows extension of ISO 8859-9 Latin-5 (Turkish), replacing most of the C1 control characters with printable glyphs. Note that this encoding is always used instead of pure Latin-5.
Windows1255	The Windows extension and rearrangement of ISO 8859-8 Latin/Hebrew.
Windows1256	Windows Arabic.
Windows1257	Windows Baltic.
Windows1258	Windows Vietnamese.
Replacement	The input is reduced to a single `\xFFFD` replacement character. No encoder is provided for this scheme.
UserDefined	Non-ASCII bytes (`\x80` through `\xFF`) are mapped to a portion of the Unicode Private Use Area (`\xF780` through `\xF7FF`).

Instances

Instances details

Bounded Encoding Source #
Instance details Defined in Web.Willow.Common.Encoding.Common Methods minBound :: Encoding # maxBound :: Encoding #
Enum Encoding Source #
Instance details Defined in Web.Willow.Common.Encoding.Common Methods succ :: Encoding -> Encoding # pred :: Encoding -> Encoding # toEnum :: Int -> Encoding # fromEnum :: Encoding -> Int # enumFrom :: Encoding -> [Encoding] # enumFromThen :: Encoding -> Encoding -> [Encoding] # enumFromTo :: Encoding -> Encoding -> [Encoding] # enumFromThenTo :: Encoding -> Encoding -> Encoding -> [Encoding] #
Eq Encoding Source #
Instance details Defined in Web.Willow.Common.Encoding.Common Methods (==) :: Encoding -> Encoding -> Bool # (/=) :: Encoding -> Encoding -> Bool #
Ord Encoding Source #
Instance details Defined in Web.Willow.Common.Encoding.Common Methods compare :: Encoding -> Encoding -> Ordering # (<) :: Encoding -> Encoding -> Bool # (<=) :: Encoding -> Encoding -> Bool # (>) :: Encoding -> Encoding -> Bool # (>=) :: Encoding -> Encoding -> Bool # max :: Encoding -> Encoding -> Encoding # min :: Encoding -> Encoding -> Encoding #
Read Encoding Source #
Instance details Defined in Web.Willow.Common.Encoding.Common Methods readsPrec :: Int -> ReadS Encoding # readList :: ReadS [Encoding] # readPrec :: ReadPrec Encoding # readListPrec :: ReadPrec [Encoding] #
Show Encoding Source #
Instance details Defined in Web.Willow.Common.Encoding.Common Methods showsPrec :: Int -> Encoding -> ShowS # show :: Encoding -> String # showList :: [Encoding] -> ShowS #
Hashable Encoding Source #
Instance details Defined in Web.Willow.Common.Encoding.Common Methods hashWithSalt :: Int -> Encoding -> Int # hash :: Encoding -> Int #

data DecoderState Source #

All the data which needs to be tracked for correct behaviour in decoding a binary stream into readable text.

Instances

Instances details

Eq DecoderState Source #
Instance details Defined in Web.Willow.Common.Encoding.Common Methods (==) :: DecoderState -> DecoderState -> Bool # (/=) :: DecoderState -> DecoderState -> Bool #
Read DecoderState Source #
Instance details Defined in Web.Willow.Common.Encoding.Common Methods readsPrec :: Int -> ReadS DecoderState # readList :: ReadS [DecoderState] # readPrec :: ReadPrec DecoderState # readListPrec :: ReadPrec [DecoderState] #
Show DecoderState Source #
Instance details Defined in Web.Willow.Common.Encoding.Common Methods showsPrec :: Int -> DecoderState -> ShowS # show :: DecoderState -> String # showList :: [DecoderState] -> ShowS #

decoderEncoding :: DecoderState -> Encoding Source #

Retrieve the encoding scheme currently used by the decoder to decode the binary document stream.

decoderRemainder :: DecoderState -> ShortByteString Source #

Any leftover bytes at the end of the binary stream, which require further input to be processed in order to correctly map to a character or error value.

data ReparseData Source #

HTML: change the encoding

The data required to determine if a new encoding would produce an identical output to what the current one has already done, and to restart the parsing with the new one if the two are incompatible. Values may be easily initialized via emptyReparseData.

Instances

Instances details

Eq ReparseData Source #
Instance details Defined in Web.Willow.Common.Encoding.Common Methods (==) :: ReparseData -> ReparseData -> Bool # (/=) :: ReparseData -> ReparseData -> Bool #
Read ReparseData Source #
Instance details Defined in Web.Willow.Common.Encoding.Common Methods readsPrec :: Int -> ReadS ReparseData # readList :: ReadS [ReparseData] # readPrec :: ReadPrec ReparseData # readListPrec :: ReadPrec [ReparseData] #
Show ReparseData Source #
Instance details Defined in Web.Willow.Common.Encoding.Common Methods showsPrec :: Int -> ReparseData -> ShowS # show :: ReparseData -> String # showList :: [ReparseData] -> ShowS #

data EncoderState Source #

All the data which needs to be tracked for correct behaviour in decoding a binary stream into readable text.

Instances

Instances details

Eq EncoderState Source #
Instance details Defined in Web.Willow.Common.Encoding.Common Methods (==) :: EncoderState -> EncoderState -> Bool # (/=) :: EncoderState -> EncoderState -> Bool #
Read EncoderState Source #
Instance details Defined in Web.Willow.Common.Encoding.Common Methods readsPrec :: Int -> ReadS EncoderState # readList :: ReadS [EncoderState] # readPrec :: ReadPrec EncoderState # readListPrec :: ReadPrec [EncoderState] #
Show EncoderState Source #
Instance details Defined in Web.Willow.Common.Encoding.Common Methods showsPrec :: Int -> EncoderState -> ShowS # show :: EncoderState -> String # showList :: [EncoderState] -> ShowS #

Initialization

Decoding

initialDecoderState :: Encoding -> DecoderState Source #

The collection of data which, for any given encoding scheme, results in behaviour according to the vanilla decoder before any bytes have been read.

setEncodingCertain :: Encoding -> DecoderState -> DecoderState Source #

Instruct the decoder that the binary document stream is known to be in the certain encoding.

setRemainder :: ShortByteString -> DecoderState -> DecoderState Source #

Store the given binary sequence as unparsable without further input, to be prepended to the beginning of stream on the next decode or decode' call.

Encoding

initialEncoderState :: Encoding -> EncoderState Source #

The collection of data which, for any given encoding scheme, results in behaviour according to the vanilla decoder before any bytes have been read.

Transformations

Decoding

The standard decode and decode' functions (and therefore the similar but higher-level functions which build on it) defer to a byte-order mark over the argument encoding. If this behaviour isn't desired (i.e., you want to force the parser to use the encoding, even if it's not appropriate), try to explicitly parse byteOrderMark first:

(_, input') = byteOrderMark input
Just text = decode enc input'

decode :: DecoderState -> ByteString -> ([Either ShortByteString String], DecoderState) Source #

Encoding: run an encoding's decoder with error mode fatal

Given a character encoding scheme, transform a dependant ByteString into portable Chars. If any byte sequences are meaningless or illegal, they are returned verbatim for error reporting; a Left should not be parsed further.

See decodeStep to decode only a minimal section, or decode' for simple error replacement. Call finalizeDecode on the returned DecoderState if no further bytes will be added to the document stream.

decode' :: DecoderState -> ByteString -> (Text, DecoderState) Source #

Encoding: decode

Given a character encoding scheme, transform a dependant ByteString into a portable Text. If any byte sequences are meaningless or illegal, they are replaced with the Unicode replacement character \xFFFD.

See decodeStep' to decode only a minimal section, or decode if the original data should be retained for custom error reporting. Call finalizeDecode' on the returned DecoderState if no further bytes will be added to the document stream.

byteOrderMark :: ByteString -> (Maybe Encoding, ByteString) Source #

Encoding: BOM sniff

Checks for a "byte-order mark" signature character in various encodings. If present, returns both the encoding found and the remainder of the stream, otherwise returns the input unchanged.

finalizeDecode :: DecoderState -> [Either ShortByteString String] Source #

Explicitly indicate that the input stream will not contain any further bytes, and perform any finalization processing based on that.

See finalizeDecode' for simple error replacement.

finalizeDecode' :: DecoderState -> Text Source #

Explicitly indicate that the input stream will not contain any further bytes, and perform any finalization processing based on that.

See finalizeDecode if the original data should be retained for custom error reporting.

UTF-8

decodeUtf8 :: ByteString -> ([Either ShortByteString String], DecoderState) Source #

Read a binary stream of UTF-8 encoded text. If the stream begins with a UTF-8 byte-order mark, it's silently dropped (any other BOM is ignored but remains in the output). Fails (returning a Left) if the stream contains byte sequences which don't represent any character, or which encode a surrogate character.

See decodeUtf8' for simple error replacement, or decodeUtf8NoBom if the BOM should always be retained.

decodeUtf8NoBom :: ByteString -> ([Either ShortByteString String], DecoderState) Source #

Encoding: UTF-8 decode without BOM or fail

Read a binary stream of UTF-8 encoded text. If the stream begins with a byte-order mark, it is kept as the first character of the output. Fails (returning a Left) if the stream contains byte sequences which don't represent any character, or which encode a surrogate character.

See decodeUtf8NoBom' for simple error replacement, or decodeUtf8' if a redundant UTF-8 BOM should be dropped.

decodeUtf8' :: ByteString -> (Text, DecoderState) Source #

Encoding: UTF-8 decode

Read a binary stream of UTF-8 encoded text. If the stream begins with a UTF-8 byte-order mark, it's silently dropped (any other BOM is ignored but remains in the output). Any surrogate characters or invalid byte sequences are replaced with the Unicode replacement character \xFFFD.

See decodeUtf8 if the original data should be retained for custom error reporting, or decodeUtf8NoBom' if the BOM should always be retained.

decodeUtf8NoBom' :: ByteString -> (Text, DecoderState) Source #

Encoding: UTF-8 decode without BOM

Read a binary stream of UTF-8 encoded text. If the stream begins with a byte-order mark, it is kept as the first character of the output. Any surrogate characters or invalid byte sequences are replaced with the Unicode replacement character \xFFFD.

See decodeUtf8NoBom if the original data should be retained for custom error reporting, or decodeUtf8' if a redundant UTF-8 BOM should be dropped.

Encoding

encode :: EncoderState -> Text -> ([Either Char ShortByteString], EncoderState) Source #

Encoding: run an encoding's encoder with error mode fatal

Given a character encoding scheme, transform a portable Text into a sequence of bytes representing those characters. If the encoding scheme does not define a binary representation for a character in the input, the original Char is returned unchanged for custom error reporting.

See encodeStep to encode only a minimal section, or encode' for escaping with HTML-style character codes.

encode' :: EncoderState -> Text -> (ByteString, EncoderState) Source #

Encoding: encode

Given a character encoding scheme, transform a portable Text into a sequence of bytes representing those characters. If the encoding scheme does not define a binary representation for a character in the input, they are replaced with an HTML-style escape (e.g. "").

See encodeStep' to encode only a minimal section, or encode if the original data should be retained for custom error reporting.

encodeUtf8 :: Text -> (ByteString, EncoderState) Source #

Encoding: UTF-8 encode

Transform a portable Text into a sequence of bytes according to the UTF-8 encoding scheme.

Continuations

decodeStep :: DecoderState -> ByteString -> (Maybe (Either ShortByteString String), DecoderState, ByteString) Source #

Encoding: run an encoding's decoder with error mode fatal

Read the smallest number of bytes from the head of the ByteString which would leave the decoder in a re-enterable state. If any byte sequences are meaningless or illegal, they are returned verbatim for error reporting; a Left should not be parsed further.

See decode to decode the entire string at once, or decodeStep' for simple error replacement.

encodeStep :: EncoderState -> Text -> Maybe (Either Char ShortByteString, EncoderState, Text) Source #

Encoding: run an encoding's encoder with error mode fatal

Read the smallest number of characters from the head of the Text which would leave the encoder in a re-enterable state. If the encoding scheme does not define a binary representation for a character in the input, the original Char is returned unchanged for custom error reporting.

See encode to decode the entire string at once, or encodeStep' for simple error replacement.

decodeStep' :: DecoderState -> ByteString -> (Maybe String, DecoderState, ByteString) Source #

Encoding: run an encoding's decoder with error mode replacement

Read the smallest number of bytes from the head of the ByteString which would leave the decoder in a re-enterable state. Any byte sequences which are meaningless or illegal are replaced with the Unicode replacement character \xFFFD.

See decode' to decode the entire string at once, or decodeStep if the original data should be retained for custom error reporting.

encodeStep' :: EncoderState -> Text -> Maybe (ShortByteString, EncoderState, Text) Source #

Encoding: run an encoding's encoder with error mode html

Read the smallest number of characters from the head of the Text which would leave the encoder in a re-enterable state. If the encoding scheme does not define a binary representation for a character in the input, they are replaced with an HTML-style escape (e.g. "").

See encode' to encode the entire string at once, or encodeStep if the original data should be retained for custom error reporting.

Internal

These types will almost certainly not be useful for anyone using the library, and are exported purely for internal usage. They can be safely ignored. Note, however, that they may be removed without warning.

data InnerDecoderState Source #

The union of all state variables tracked by the bytes-to-Char decoding algorithm of a single encoding scheme.

Instances

Instances details

Eq InnerDecoderState Source #
Instance details Defined in Web.Willow.Common.Encoding Methods (==) :: InnerDecoderState -> InnerDecoderState -> Bool # (/=) :: InnerDecoderState -> InnerDecoderState -> Bool #
Read InnerDecoderState Source #
Instance details Defined in Web.Willow.Common.Encoding Methods readsPrec :: Int -> ReadS InnerDecoderState # readList :: ReadS [InnerDecoderState] # readPrec :: ReadPrec InnerDecoderState # readListPrec :: ReadPrec [InnerDecoderState] #
Show InnerDecoderState Source #
Instance details Defined in Web.Willow.Common.Encoding Methods showsPrec :: Int -> InnerDecoderState -> ShowS # show :: InnerDecoderState -> String # showList :: [InnerDecoderState] -> ShowS #

data InnerEncoderState Source #

The union of all state variables tracked by the Char-to-bytes encoding algorithm of a single encoding scheme.

Instances

Instances details

Eq InnerEncoderState Source #
Instance details Defined in Web.Willow.Common.Encoding Methods (==) :: InnerEncoderState -> InnerEncoderState -> Bool # (/=) :: InnerEncoderState -> InnerEncoderState -> Bool #
Read InnerEncoderState Source #
Instance details Defined in Web.Willow.Common.Encoding Methods readsPrec :: Int -> ReadS InnerEncoderState # readList :: ReadS [InnerEncoderState] # readPrec :: ReadPrec InnerEncoderState # readListPrec :: ReadPrec [InnerEncoderState] #
Show InnerEncoderState Source #
Instance details Defined in Web.Willow.Common.Encoding Methods showsPrec :: Int -> InnerEncoderState -> ShowS # show :: InnerEncoderState -> String # showList :: [InnerEncoderState] -> ShowS #

Key	Shortcut
s	Open this search box
esc	Close this search box
↓,ctrl + j	Move down in search results
↑,ctrl + k	Move up in search results
↵	Go to active search result