hxt-8.5.0: A collection of tools for processing XML with Haskell.

Portabilityportable
Stabilityexperimental
MaintainerUwe Schmidt (uwe@fh-wedel.de)

Text.XML.HXT.DOM.Unicode

Contents

Description

Unicode and UTF-8 Conversion Functions

Synopsis

Unicode Type declarations

type Unicode = CharSource

Unicode is represented as the Char type Precondition for this is the support of Unicode character range in the compiler (e.g. ghc but not hugs)

type UString = [Unicode]Source

the type for Unicode strings

type UTF8Char = CharSource

UTF-8 charachters are represented by the Char type

type UTF8String = StringSource

UTF-8 strings are implemented as Haskell strings

type DecodingFct = String -> (UString, [String])Source

Decoding function with a pair containing the result string and a list of decoding errors as result

type DecodingFctEmbedErrors = String -> UStringWithErrorsSource

Decoding function where decoding errors are interleaved with decoded characters

XML char predicates

isXmlChar :: Unicode -> BoolSource

checking for valid XML characters

isXmlLatin1Char :: Unicode -> BoolSource

test for a legal latin1 XML char

isXmlSpaceChar :: Unicode -> BoolSource

checking for XML space character: \n, \r, \t and " "

isXml11SpaceChar :: Unicode -> BoolSource

checking for XML1.1 space character: additional space 0x85 and 0x2028

see also : isXmlSpaceChar

isXmlNameChar :: Unicode -> BoolSource

checking for XML name character

isXmlNameStartChar :: Unicode -> BoolSource

checking for XML name start character

see also : isXmlNameChar

isXmlNCNameChar :: Unicode -> BoolSource

checking for XML NCName character: no ":" allowed

see also : isXmlNameChar

isXmlNCNameStartChar :: Unicode -> BoolSource

checking for XML NCName start character: no ":" allowed

see also : isXmlNameChar, isXmlNCNameChar

isXmlPubidChar :: Unicode -> BoolSource

checking for XML public id character

isXmlLetter :: Unicode -> BoolSource

checking for XML letter

isXmlBaseChar :: Unicode -> BoolSource

checking for XML base charater

isXmlIdeographicChar :: Unicode -> BoolSource

checking for XML ideographic charater

isXmlCombiningChar :: Unicode -> BoolSource

checking for XML combining charater

isXmlDigit :: Unicode -> BoolSource

checking for XML digit

isXmlExtender :: Unicode -> BoolSource

checking for XML extender

isXmlControlOrPermanentlyUndefined :: Unicode -> BoolSource

checking for XML control or permanently discouraged char

see Errata to XML1.0 (http://www.w3.org/XML/xml-V10-2e-errata) No 46

Document authors are encouraged to avoid compatibility characters, as defined in section 6.8 of [Unicode] (see also D21 in section 3.6 of [Unicode3]). The characters defined in the following ranges are also discouraged. They are either control characters or permanently undefined Unicode characters:

UTF-8 and Unicode conversion functions

utf8ToUnicode :: DecodingFctSource

UTF-8 to Unicode conversion with deletion of leading byte order mark, as described in XML standard F.1

latin1ToUnicode :: String -> UStringSource

code conversion from latin1 to Unicode

ucs2ToUnicode :: String -> UStringSource

UCS-2 to UTF-8 conversion with byte order mark analysis

ucs2BigEndianToUnicode :: String -> UStringSource

UCS-2 big endian to Unicode conversion

ucs2LittleEndianToUnicode :: String -> UStringSource

UCS-2 little endian to Unicode conversion

utf16beToUnicode :: String -> UStringSource

UTF-16 big endian to UTF-8 conversion with removal of byte order mark

utf16leToUnicode :: String -> UStringSource

UTF-16 little endian to UTF-8 conversion with removal of byte order mark

unicodeCharToUtf8 :: Unicode -> UTF8StringSource

conversion from Unicode (Char) to a UTF8 encoded string.

unicodeToUtf8 :: UString -> UTF8StringSource

conversion from Unicode strings (UString) to UTF8 encoded strings.

unicodeToXmlEntity :: UString -> StringSource

substitute all Unicode characters, that are not legal 1-byte UTF-8 XML characters by a character reference.

This function can be used to translate all text nodes and attribute values into pure ascii.

see also : unicodeToLatin1

unicodeToLatin1 :: UString -> StringSource

substitute all Unicode characters, that are not legal latin1 UTF-8 XML characters by a character reference.

This function can be used to translate all text nodes and attribute values into ISO latin1.

see also : unicodeToXmlEntity

unicodeRemoveNoneAscii :: UString -> StringSource

removes all non ascii chars, may be used to transform a document into a pure ascii representation by removing all non ascii chars from tag and attibute names

see also : unicodeRemoveNoneLatin1, unicodeToXmlEntity

unicodeRemoveNoneLatin1 :: UString -> StringSource

removes all non latin1 chars, may be used to transform a document into a pure ascii representation by removing all non ascii chars from tag and attibute names

see also : unicodeRemoveNoneAscii, unicodeToLatin1

intToCharRef :: Int -> StringSource

convert an Unicode into a XML character reference.

see also : intToCharRefHex

intToCharRefHex :: Int -> StringSource

convert an Unicode into a XML hexadecimal character reference.

see also: intToCharRef

getDecodingFct :: String -> Maybe DecodingFctSource

the lookup function for selecting the decoding function

getDecodingFctEmbedErrors :: String -> Maybe DecodingFctEmbedErrorsSource

the lookup function for selecting the decoding function

getOutputEncodingFct :: String -> Maybe (String -> UString)Source

the lookup function for selecting the encoding function

normalizeNL :: String -> StringSource

White Space (XML Standard 2.3) and end of line handling (2.11)

#x0D and #x0D#x0A are mapped to #x0A