Copyright | (c) 2006-2007 Duncan Coutts |
---|---|
License | BSD-style |
Maintainer | duncan@haskell.org |
Portability | portable (H98 + FFI) |
Safe Haskell | None |
Language | Haskell98 |
String encoding conversion
- convert :: EncodingName -> EncodingName -> ByteString -> ByteString
- type EncodingName = String
- convertFuzzy :: Fuzzy -> EncodingName -> EncodingName -> ByteString -> ByteString
- data Fuzzy
- convertStrictly :: EncodingName -> EncodingName -> ByteString -> Either ByteString ConversionError
- convertLazily :: EncodingName -> EncodingName -> ByteString -> [Span]
- data ConversionError
- reportConversionError :: ConversionError -> IOError
- data Span
Documentation
This module provides pure functions for converting the string encoding
of strings represented by lazy ByteString
s. This makes it easy to use
either in memory or with disk or network IO.
For example, a simple Latin1 to UTF-8 conversion program is just:
import Codec.Text.IConv as IConv import Data.ByteString.Lazy as ByteString main = ByteString.interact (convert "LATIN1" "UTF-8")
Or you could lazily read in and convert a UTF-8 file to UTF-32 using:
content <- fmap (IConv.convert "UTF-8" "UTF-32") (readFile file)
This module uses the POSIX iconv()
library function. The primary
advantage of using iconv is that it is widely available, most systems
have a wide range of supported string encodings and the conversion speed
it typically good. The iconv library is available on all unix systems
(since it is required by the POSIX.1 standard) and GNU libiconv is
available as a standalone library for other systems, including Windows.
Simple conversion API
:: EncodingName | Name of input string encoding |
-> EncodingName | Name of output string encoding |
-> ByteString | Input text |
-> ByteString | Output text |
Convert text from one named string encoding to another.
- The conversion is done lazily.
- An exception is thrown if conversion between the two encodings is not supported.
- An exception is thrown if there are any encoding conversion errors.
type EncodingName = String Source
A string encoding name, eg "UTF-8"
or "LATIN1"
.
The range of string encodings available is determined by the capabilities of the underlying iconv implementation.
When using the GNU C or libiconv libraries, the permitted values are listed
by the iconv --list
command, and all combinations of the listed values
are supported.
Variant that is lax about conversion errors
:: Fuzzy | Whether to try and transliterate or discard characters with no direct conversion |
-> EncodingName | Name of input string encoding |
-> EncodingName | Name of output string encoding |
-> ByteString | Input text |
-> ByteString | Output text |
Convert text ignoring encoding conversion problems.
If invalid byte sequences are found in the input they are ignored and conversion continues if possible. This is not always possible especially with stateful encodings. No placeholder character is inserted into the output so there will be no indication that invalid byte sequences were encountered.
If there are characters in the input that have no direct corresponding
character in the output encoding then they are dealt in one of two ways,
depending on the Fuzzy
argument. We can try and Transliterate
them into
the nearest corresponding character(s) or use a replacement character
(typically '?'
or the Unicode replacement character). Alternatively they
can simply be Discard
ed.
In either case, no exceptions will occur. In the case of unrecoverable errors, the output will simply be truncated. This includes the case of unrecognised or unsupported encoding names; the output will be empty.
- This function only works with the GNU iconv implementation which provides this feature beyond what is required by the iconv specification.
Variants that are pedantic about conversion errors
:: EncodingName | Name of input string encoding |
-> EncodingName | Name of output string encoding |
-> ByteString | Input text |
-> Either ByteString ConversionError | Output text or conversion error |
This variant does the conversion all in one go, so it is able to report any conversion errors up front. It exposes all the possible error conditions and never throws exceptions
The disadvantage is that no output can be produced before the whole input is consumed. This might be problematic for very large inputs.
:: EncodingName | Name of input string encoding |
-> EncodingName | Name of output string encoding |
-> ByteString | Input text |
-> [Span] | Output text spans |
This version provides a more complete but less convenient conversion interface. It exposes all the possible error conditions and never throws exceptions.
The conversion is still lazy. It returns a list of spans, where a span may
be an ordinary span of output text or a conversion error. This somewhat
complex interface allows both for lazy conversion and for precise reporting
of conversion problems. The other functions convert
and convertStrictly
are actually simple wrappers on this function.
data ConversionError Source
UnsuportedConversion EncodingName EncodingName | The conversion from the input to output string encoding is not supported by the underlying iconv implementation. This is usually because a named encoding is not recognised or support for it was not enabled on this system. The POSIX standard does not guarantee that all possible combinations of recognised string encoding are supported, however most common implementations do support all possible combinations. |
InvalidChar Int | This covers two possible conversion errors:
Unfortunately iconv does not let us distinguish these two cases. In either case, the Int parameter gives the byte offset in the input of the unrecognised bytes or unconvertable character. |
IncompleteChar Int | This error covers the case where the end of the input has trailing bytes that are the initial bytes of a valid character in the input encoding. In other words, it looks like the input ended in the middle of a multi-byte character. This would often be an indication that the input was somehow truncated. Again, the Int parameter is the byte offset in the input where the incomplete character starts. |
UnexpectedError Errno | An unexpected iconv error. The iconv spec lists a number of possible expected errors but does not guarantee that there might not be other errors. This error can occur either immediately, which might indicate that the iconv installation is messed up somehow, or it could occur later which might indicate resource exhaustion or some other internal iconv error. Use |
Output spans from encoding conversion. When nothing goes wrong we
expect just a bunch of Span
s. If there are conversion errors we get other
span types.
Span !ByteString | An ordinary output span in the target encoding |
ConversionError !ConversionError | An error in the conversion process. If this occurs it will be the last span. |