Changes between Version 4 and Version 5 of SourceEncodingDetection
- Timestamp:
- 08/27/06 03:01:03 (7 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
SourceEncodingDetection
v4 v5 4 4 == Brief Explanation == 5 5 6 Haskell source code uses the Unicode character set. However, current implementations either support only one encoding (e.g. UTF-8), or require the encoding to be signified via out-of-band means, which makes Haskell source code outside ASCII range non-portable.6 Haskell source code uses the Unicode character set. However, current implementations either support only one encoding (e.g. UTF-8), or require the encoding to be signified via out-of-band means, which makes Haskell source code outside the ASCII range non-portable (see UnicodeInHaskellSource). 7 7 8 This proposal outlines a detection heuristic sthat categorizes the source code as under UTF-8, UTF-16 or UTF-32. A conforming Haskell-prime implementation must accept UTF-8 and UTF-16, and may fail on UTF-32 input.8 This proposal outlines a detection heuristic that categorizes the source code as under UTF-8, UTF-16 or UTF-32. A conforming Haskell-prime implementation must accept UTF-8 and UTF-16, and may fail on UTF-32 input. 9 9 10 10 This proposal does not cover user-specified source encoding. 11 11 12 == References == 13 14 * [http://www.unicode.org/faq/utf_bom.html Unicode UTF and Byte Order Mark FAQ] 15 12 16 == Proposal == 13 17 14 This heuristic suses at most 4 bytes from the byte representation of Haskell source code.18 This heuristic uses at most 4 bytes from the byte representation of Haskell source code. 15 19 16 20 {{{ … … 53 57 }}} 54 58 55 The heuristic shas the following properties:59 The heuristic has the following properties: 56 60 * Byte-order mark is optional on all three encodings. 57 61 * If present, byte-order-marks are consumed before lexical analysis. 58 * Source code known to begin with the NULL ch racter is disallowed.62 * Source code known to begin with the NULL character is disallowed. 59 63 60 64 Furthermore, as long as the first logical characters in the program is 61 under codepoint 0xFF (the "ASCII/Latin1" range), this heuristic scan always62 gracefully handle two common class of text editor flaws:65 under codepoint 0xFF (the "ASCII/Latin1" range), this heuristic can always 66 gracefully handle two common classes of text editor flaws: 63 67 * Emitting byte-order mark for UTF-8 text. 64 68 * Omitting byte-order mark for UTF-16 or UTF-32 text. … … 71 75 * Mandating a minimum support for UTF-8/UTF-16 places an implementation burden on compiler writers. 72 76 * Existing code relying on a non-UTF8, locale-/implementation-specific encoding will need conversion. 77 (At present, this is mostly Latin-1.)
