Copyright	(c) 2006-2007 Duncan Coutts
License	BSD-style
Maintainer	duncan@haskell.org
Portability	portable (H98 + FFI)
Safe Haskell	None
Language	Haskell98

Codec.Text.IConv

Contents

Simple conversion API
Variant that is lax about conversion errors
Variants that are pedantic about conversion errors

Description

String encoding conversion

Synopsis

Documentation

This module provides pure functions for converting the string encoding of strings represented by lazy ByteStrings. This makes it easy to use either in memory or with disk or network IO.

For example, a simple Latin1 to UTF-8 conversion program is just:

import Codec.Text.IConv as IConv
import Data.ByteString.Lazy as ByteString

main = ByteString.interact (convert "LATIN1" "UTF-8")

Or you could lazily read in and convert a UTF-8 file to UTF-32 using:

content <- fmap (IConv.convert "UTF-8" "UTF-32") (readFile file)

This module uses the POSIX iconv() library function. The primary advantage of using iconv is that it is widely available, most systems have a wide range of supported string encodings and the conversion speed it typically good. The iconv library is available on all unix systems (since it is required by the POSIX.1 standard) and GNU libiconv is available as a standalone library for other systems, including Windows.

Simple conversion API

convert Source

Arguments

:: EncodingName	Name of input string encoding
-> EncodingName	Name of output string encoding
-> ByteString	Input text
-> ByteString	Output text

Convert text from one named string encoding to another.

The conversion is done lazily.
An exception is thrown if conversion between the two encodings is not supported.
An exception is thrown if there are any encoding conversion errors.

type EncodingName = String Source

A string encoding name, eg "UTF-8" or "LATIN1".

The range of string encodings available is determined by the capabilities of the underlying iconv implementation.

When using the GNU C or libiconv libraries, the permitted values are listed by the iconv --list command, and all combinations of the listed values are supported.

Variant that is lax about conversion errors

convertFuzzy Source

Arguments

:: Fuzzy	Whether to try and transliterate or discard characters with no direct conversion
-> EncodingName	Name of input string encoding
-> EncodingName	Name of output string encoding
-> ByteString	Input text
-> ByteString	Output text

Convert text ignoring encoding conversion problems.

If invalid byte sequences are found in the input they are ignored and conversion continues if possible. This is not always possible especially with stateful encodings. No placeholder character is inserted into the output so there will be no indication that invalid byte sequences were encountered.

If there are characters in the input that have no direct corresponding character in the output encoding then they are dealt in one of two ways, depending on the Fuzzy argument. We can try and Transliterate them into the nearest corresponding character(s) or use a replacement character (typically '?' or the Unicode replacement character). Alternatively they can simply be Discarded.

In either case, no exceptions will occur. In the case of unrecoverable errors, the output will simply be truncated. This includes the case of unrecognised or unsupported encoding names; the output will be empty.

This function only works with the GNU iconv implementation which provides this feature beyond what is required by the iconv specification.

data Fuzzy Source

Constructors

Transliterate
Discard

Variants that are pedantic about conversion errors

convertStrictly Source

Arguments

:: EncodingName	Name of input string encoding
-> EncodingName	Name of output string encoding
-> ByteString	Input text
-> Either ByteString ConversionError	Output text or conversion error

This variant does the conversion all in one go, so it is able to report any conversion errors up front. It exposes all the possible error conditions and never throws exceptions

The disadvantage is that no output can be produced before the whole input is consumed. This might be problematic for very large inputs.

convertLazily Source

Arguments

:: EncodingName	Name of input string encoding
-> EncodingName	Name of output string encoding
-> ByteString	Input text
-> [Span]	Output text spans

This version provides a more complete but less convenient conversion interface. It exposes all the possible error conditions and never throws exceptions.

The conversion is still lazy. It returns a list of spans, where a span may be an ordinary span of output text or a conversion error. This somewhat complex interface allows both for lazy conversion and for precise reporting of conversion problems. The other functions convert and convertStrictly are actually simple wrappers on this function.

data ConversionError Source

Constructors

UnsuportedConversion EncodingName EncodingName	The conversion from the input to output string encoding is not supported by the underlying iconv implementation. This is usually because a named encoding is not recognised or support for it was not enabled on this system. The POSIX standard does not guarantee that all possible combinations of recognised string encoding are supported, however most common implementations do support all possible combinations.
InvalidChar Int	This covers two possible conversion errors: There is a byte sequence in the input that is not valid in the input encoding. There is a valid character in the input that has no corresponding character in the output encoding. Unfortunately iconv does not let us distinguish these two cases. In either case, the Int parameter gives the byte offset in the input of the unrecognised bytes or unconvertable character.
IncompleteChar Int	This error covers the case where the end of the input has trailing bytes that are the initial bytes of a valid character in the input encoding. In other words, it looks like the input ended in the middle of a multi-byte character. This would often be an indication that the input was somehow truncated. Again, the Int parameter is the byte offset in the input where the incomplete character starts.
UnexpectedError Errno	An unexpected iconv error. The iconv spec lists a number of possible expected errors but does not guarantee that there might not be other errors. This error can occur either immediately, which might indicate that the iconv installation is messed up somehow, or it could occur later which might indicate resource exhaustion or some other internal iconv error. Use `errnoToIOError` to get slightly more information on what the error could possibly be.

reportConversionError :: ConversionError -> IOError Source

data Span Source

Output spans from encoding conversion. When nothing goes wrong we expect just a bunch of Spans. If there are conversion errors we get other span types.

Constructors

Span !ByteString	An ordinary output span in the target encoding
ConversionError !ConversionError	An error in the conversion process. If this occurs it will be the last span.