The Char type
The Haskell 98 Report ( Characters and Strings) states that the type Char represents Unicode, which seems to be the canonical choice. The functions of the Char module work with Unicode for GHC and Hugs, with one divergence from the Report:
- isAlpha selects Unicode alphabetic characters, not just the union of lower- and upper-case letters.
More sophisticated functions could be provided by additional libraries.
Input and Output
- All character based I/O in Hugs and jhc-compiled programs uses the encoding of the current locale.
- Other implementations perform I/O on bytes treated as characters, i.e. belonging to the Latin-1 subset.
Assuming we retain Unicode as the representation of Char:
- Flexible handling of character encodings will be needed, but there is no existing implementation. Should we specify it or leave room for experimentation?
- BinaryIO is needed anyway, and would provide a base for these encodings.
- A simple character-based I/O and system interface like that in Haskell 98, possibly taking defaults from the locale, will also be convenient for many users. However it might not be in the Prelude if we shrink the Prelude.
Strings in System functions
Native system calls use varying representations of strings:
- Unix-like systems and many others use byte strings, which may use various encodings (or may not be character data at all).
- The NTFS file system (Windows) stores filenames in UTF-16, and the Win32 interface provides functions using UTF-16. Since Windows NT, the byte-level interface is a compatibility layer over UTF-16.
- Hugs exchanges byte-strings using a byte encoding of Unicode determined by the current locale.
- Other implementations treat the byte-strings interchanged with the operation system as characters, i.e. belonging to the Latin-1 subset.
- The ForeignFunctionInterface specifies CString functions that perform locale-based conversion, but these are not yet provided by the Haskell implementations.
A disadvantage or using encodings is that some byte-strings may not be legal encodings, e.g. using a program argument as a filename may fail. Converting to String and back may also lose distinctions for some encodings. On the other hand, byte-strings are inappropriate if the underlying system uses a form of Unicode (e.g. recent Windows, and possibly more systems in the future). One way out would be to provide an abstract type for strings in O/S form. Again, the old character interface would remain useful for many.
A Straw-Man Proposal
- I/O. All raw I/O is in terms of octets, i.e. Word8
Pure functions exist to convert octets to and from any particular encoding:
stringDecode :: Encoding -> [Word8] -> [Char] stringEncode :: Encoding -> [Char] -> [Word8]The codecs must operate on strings, not individual characters, because some encodings use variable-length sequences of octets.
- Efficiency. Semantically, character-based I/O is a simple composition of the raw I/O primitives with an encoding conversion function. However, for efficiency, an implementation might choose to provide certain encoded I/O operations primitively. If such primitives are exposed to the user, they should have standard names so that other implementations can provide the same functionality in pure Haskell Prime.
It may be possible to retain the traditional I/O signatures for
hGetChar, hPutChar, readFile, writeFile, etc, but only by introducing
a stateful notion of current encoding associated with each
individual handle. The default encoding could be inherited from the
operating system environment, but it should also be possible to
change the encoding explicitly.
getIOEncoding :: Handle -> IO Encoding setIOEncoding :: Encoding -> Handle -> IO () resetIOEncoding :: Handle -> IO () -- go back to default
- Filenames, program arguments, environment.
- Filenames are stored in Haskell as [Char], but the operating system should receive [Word8] for any I/O using filenames. Some encoding conversion is therefore required. Usually, this will be platform-dependent, and so the actual encoding may be hidden from the programmer as part of the default locale.
- Program arguments, and symbols from the environment, are supplied by the operating system to the Haskell program as [Word8]. The program is responsible for conversion to [Char]. Again, there may be a default encoding chosen based on the locale.