| | 65 | |
| | 66 | == A Straw-Man Proposal == |
| | 67 | |
| | 68 | * '''Internal character representation.''' |
| | 69 | The Haskell type {{{Char}}} is UCS-4. |
| | 70 | * '''Haskell source encoding.''' |
| | 71 | * Introduce a pragma {{{{-# ENCODING e #-}}}} with a range of possible |
| | 72 | values of the encoding {{{e}}}. If the pragma is present, it must be |
| | 73 | at the beginning of the file. If it is not present, the file is |
| | 74 | encoded in Latin-1. Note that even if the pragma is present, some |
| | 75 | heuristic may be needed even to get as far as interpreting the |
| | 76 | encoding declaration, like in |
| | 77 | [http://www.w3.org/TR/2004/REC-xml11-20040204/#sec-guessing XML]. |
| | 78 | The fact that the first three characters must be {{{{-#}}} will be |
| | 79 | useful here. |
| | 80 | * A literal string may contain any literal character representable in the |
| | 81 | source encoding. In addition, escapes are provided to permit the specification of |
| | 82 | ''any'' Unicode character (which may or may not be otherwise |
| | 83 | representable in the source encoding). |
| | 84 | * An identifier may contain any Unicode alphanumeric or symbol |
| | 85 | characters from a defined range. Thus, a source text may not be |
| | 86 | representable in certain other encodings (especially in ASCII). |
| | 87 | * '''I/O.''' |
| | 88 | All raw I/O is in terms of octets, i.e. {{{Word8}}} |
| | 89 | * '''Conversions.''' |
| | 90 | Pure functions exist to convert octets to and from any particular encoding: |
| | 91 | {{{ |
| | 92 | stringDecode :: Encoding -> [Word8] -> [Char] |
| | 93 | stringEncode :: Encoding -> [Char] -> [Word8] |
| | 94 | }}} |
| | 95 | The codecs must operate on strings, not individual characters, because some |
| | 96 | encodings use variable-length sequences of octets. |
| | 97 | * '''Efficiency.''' |
| | 98 | Semantically, character-based I/O is a simple composition of the raw |
| | 99 | I/O primitives with an encoding conversion function. However, for |
| | 100 | efficiency, an implementation might choose to provide certain encoded |
| | 101 | I/O operations primitively. If such primitives are exposed to the |
| | 102 | user, they should have standard names so that other implementations can |
| | 103 | provide the same functionality in pure Haskell Prime. |
| | 104 | * '''Locales.''' |
| | 105 | It may be possible to retain the traditional I/O signatures for |
| | 106 | hGetChar, hPutChar, readFile, writeFile, etc, but only by introducing |
| | 107 | a stateful notion of ''current encoding'' associated with each |
| | 108 | individual handle. The default encoding could be inherited from the |
| | 109 | operating system environment, but it should also be possible to |
| | 110 | change the encoding explicitly. |
| | 111 | {{{ |
| | 112 | getIOEncoding :: Handle -> IO Encoding |
| | 113 | setIOEncoding :: Encoding -> Handle -> IO () |
| | 114 | resetIOEncoding :: Handle -> IO () -- go back to default |
| | 115 | }}} |
| | 116 | * '''Filenames, program arguments, environment.''' |
| | 117 | * Filenames are stored in Haskell as {{{[Char]}}}, but the operating |
| | 118 | system should receive {{{[Word8]}}} for any I/O using filenames. |
| | 119 | Some encoding conversion is therefore required. Usually, this will |
| | 120 | be platform-dependent, and so the actual encoding may be hidden |
| | 121 | from the programmer as part of the default locale. |
| | 122 | * Program arguments, and symbols from the environment, are supplied |
| | 123 | by the operating system to the Haskell program as {{{[Word8]}}}. |
| | 124 | The program is responsible for conversion to {{{[Char]}}}. Again, |
| | 125 | there may be a default encoding chosen based on the locale. |