Safe Haskell | Safe-Inferred |
---|---|
Language | Haskell2010 |
Parser.Lathe.Encoding.UTF8
Description
Functions for parsing UTF-8.
Parsing of UTF-8 code points is broken down into two steps
to allow for full or partial validation of characters as opposed to full
Char
conversions.
The following is an example of parsing a UTF-8 code point into a character:
data Error = EoF | Malformed | Surrogate | … charUtf8 ::Parser
Error Char charUtf8 = do u <-unitUtf8
(\_ -> Malformed) EoF case u ofUTF8_1
u1 -> pure $!fromUtf8
u1UTF8_2
u2 ->contUtf8_2
Malformed EoF u2UTF8_3
u3 -> do p <-contUtf8_3
(\_ -> Malformed) EoF u3 ifisSurrogate
p then err Surrogate else pure $! fromUtf8 pUTF8_4
u4 ->contUtf8_4
(\_ -> Malformed) EoF u4
Synopsis
- utf8BOM :: e -> e -> Parser e ()
- newtype UTF8Unit (n :: Nat) = UTF8Unit Word8
- newtype UTF8Point (n :: Nat) = UTF8Point Word32
- fromUtf8 :: UTF8Point n -> Char
- isSurrogate :: UTF8Point 3 -> Bool
- data UTF8Branch
- data UTF8Error_1
- unitUtf8 :: (UTF8Error_1 -> e) -> e -> Parser e UTF8Branch
- contUtf8_2 :: e -> e -> UTF8Unit 2 -> Parser e Char
- skipUtf8_2 :: e -> e -> UTF8Unit 2 -> Parser e ()
- data UTF8Error_3
- contUtf8_3 :: (UTF8Error_3 -> e) -> e -> UTF8Unit 3 -> Parser e (UTF8Point 3)
- skipUtf8_3 :: (UTF8Error_3 -> e) -> e -> UTF8Unit 3 -> Parser e Bool
- data UTF8Error_4
- contUtf8_4 :: (UTF8Error_4 -> e) -> e -> UTF8Unit 4 -> Parser e Char
- skipUtf8_4 :: (UTF8Error_4 -> e) -> e -> UTF8Unit 4 -> Parser e ()
Byte-order mark
Arguments
:: e | Malformed. |
-> e | Reached end. |
-> Parser e () |
Consume 3 bytes that represent a UTF-8 byte-order mark.
UTF-8
newtype UTF8Unit (n :: Nat) Source #
First UTF-8 code unit.
n
represents the total number of code units in this code point.
isSurrogate :: UTF8Point 3 -> Bool Source #
Check whether a 3-unit code point lies in the surrogate range (U+D800
to U+DFFF
).
First code unit
data UTF8Branch Source #
UTF-8 branching based on the first code unit.
data UTF8Error_1 Source #
Errors that may be encountered when parsing the first UTF-8 code unit.
Constructors
UTF8Continuation | (byte 1 is |
UTF8Overlong_2 | (byte 1 is |
UTF8Overflow_1 | (byte 1 is |
UTF8Invalid | (byte 1 is |
Instances
Show UTF8Error_1 Source # | |
Defined in Parser.Lathe.Encoding.UTF8 Methods showsPrec :: Int -> UTF8Error_1 -> ShowS # show :: UTF8Error_1 -> String # showList :: [UTF8Error_1] -> ShowS # | |
Eq UTF8Error_1 Source # | |
Defined in Parser.Lathe.Encoding.UTF8 |
Arguments
:: (UTF8Error_1 -> e) | Malformed |
-> e | Reached end. |
-> Parser e UTF8Branch |
Consume 1 byte that represents the first code unit of a UTF-8 code point.
Second code unit
Arguments
:: e | (byte 2 is not |
-> e | Reached end. |
-> UTF8Unit 2 | |
-> Parser e Char |
Consume 1 byte that represents the second code unit of a 2-unit UTF-8 code point
and convert the two units into a Char
.
Arguments
:: e | (byte 2 is not |
-> e | Reached end. |
-> UTF8Unit 2 | |
-> Parser e () |
Consume 1 byte that represents the second code unit of a 2-unit UTF-8 code point.
Third code unit
data UTF8Error_3 Source #
Errors that may be encountered when parsing the third UTF-8 code unit.
Constructors
UTF8Incomplete_2_3 | (byte 2 is not |
UTF8Overlong_3 | (byte 1 is |
UTF8Incomplete_3_3 | (byte 3 is not |
Instances
Show UTF8Error_3 Source # | |
Defined in Parser.Lathe.Encoding.UTF8 Methods showsPrec :: Int -> UTF8Error_3 -> ShowS # show :: UTF8Error_3 -> String # showList :: [UTF8Error_3] -> ShowS # | |
Eq UTF8Error_3 Source # | |
Defined in Parser.Lathe.Encoding.UTF8 |
Arguments
:: (UTF8Error_3 -> e) | Malformed. |
-> e | Reached end. |
-> UTF8Unit 3 | |
-> Parser e (UTF8Point 3) |
Consume 2 bytes that represent the second and third code units of a 3-unit UTF-8 code point and convert the three units into a code point.
Arguments
:: (UTF8Error_3 -> e) | Malformed. |
-> e | Reached end. |
-> UTF8Unit 3 | |
-> Parser e Bool |
|
Consume 2 bytes that represent the second and third code units of a 3-unit UTF-8 code point.
Fourth code unit
data UTF8Error_4 Source #
Errors that may be encountered when parsing the fourth UTF-8 code unit.
Constructors
UTF8Incomplete_2_4 | (byte 2 is not |
UTF8Overlong_4 | (byte 1 is |
UTF8Overflow_2 | (byte 1 is |
UTF8Incomplete_3_4 | (byte 3 is not |
UTF8Incomplete_4_4 | (byte 4 is not |
Instances
Show UTF8Error_4 Source # | |
Defined in Parser.Lathe.Encoding.UTF8 Methods showsPrec :: Int -> UTF8Error_4 -> ShowS # show :: UTF8Error_4 -> String # showList :: [UTF8Error_4] -> ShowS # | |
Eq UTF8Error_4 Source # | |
Defined in Parser.Lathe.Encoding.UTF8 |
Arguments
:: (UTF8Error_4 -> e) | Malformed. |
-> e | Reached end. |
-> UTF8Unit 4 | |
-> Parser e Char |
Consume 3 bytes that represent the second to fourth code units of
a 4-unit UTF-8 code point and convert the four units into a Char
.
Arguments
:: (UTF8Error_4 -> e) | Malformed. |
-> e | Reached end. |
-> UTF8Unit 4 | |
-> Parser e () |
Consume 3 bytes that represent the second to fourth code units of a 4-unit UTF-8 code point.