-- | -- Conversion combinator module structure is similar to one found in /text/ and /bytestring/ packages -- And can be found nested under this module: -- -- * "Data.TypedEncoding.Conv.Text" -- * "Data.TypedEncoding.Conv.Text.Encoding" -- * "Data.TypedEncoding.Conv.Text.Lazy" -- * "Data.TypedEncoding.Conv.Text.Lazy.Encoding" -- * "Data.TypedEncoding.Conv.ByteString.Char8" -- * "Data.TypedEncoding.Conv.ByteString.Lazy.Char8" -- -- Two goals of these conversions are: -- -- * provide added type safety for string conversions -- * provide a way to easily convert encoded data directly between /text/ and /bytestring/ types. -- -- -- == Type Safety -- -- Haskell has 3 (5 counting lazy versions) popular string types, unfortunately they are all quite different. -- -- Consider these 4 popular conversion functions: -- -- @ -- import qualified Data.Text as T -- import qualified Data.ByteString.Char8 as B8 -- import qualified Data.Text.Encoding as TE -- -- T.pack :: String -> T.Text -- T.unpack :: T.Text -> String -- B8.pack :: String -> B8.ByteString -- B8.unpack :: B8.ByteString -> String -- TE.decodeUtf8 :: B8.ByteString -> T.Text -- TE.encodeUtf8 :: T.Text -> B8.ByteString -- @ -- -- They come in pairs but are not reversible. -- -- Going from A to C depends on B, none of the following 4 diagrams commutes: -- -- @ -- String -> B8.pack -> ByteString -- ^ ^ | -- | | TE.decodeUtf8 -- id | | -- | TE.encodeUtf8 | -- v | v -- String -> T.pack -> Text -- @ -- -- @ -- String <- B8.unpack <- ByteString -- ^ ^ | -- | | TE.decodeUtf8 -- id | | -- | TE.encodeUtf8 | -- v | v -- String <- T.unpack <- Text -- @ -- -- All of this can lead to bugs that are hard to find and hard to troubleshoot. -- -- /typed-encoding/ provides more precise types so that all of this goes away. -- -- Here are the type signatures simplified to one single encoding annotation: -- -- @ -- import qualified Data.TypedEncoding.Conv.Text as ET -- import qualified Data.TypedEncoding.Conv.ByteString.Char8 as EB8 -- import qualified Data.TypedEncoding.Conv.Text.Encoding as ETE -- -- ET.pack :: (Superset "r-UNICODE.D76" r) => Enc '[r] c String -> Enc '[r] c T.Text -- ET.unpack :: (Superset "r-UNICODE.D76" r) => Enc '[r] c T.Text -> Enc '[r] c String -- EB8.pack :: (Superset "r-CHAR8" r) => Enc '[r] c String -> Enc '[r] c B8.ByteString -- EB8.unpack :: (Superset "r-CHAR8" r) => Enc '[r] c B8.ByteString -> Enc '[r] c String -- ETE.decodeUtf8 :: (Superset "r-UTF8" r) => Enc '[r] c B8.ByteString -> Enc '[r] c T.Text -- ETE.encodeUtf8 :: (Superset "r-UTF8" r) => Enc '[r] c T.Text -> Enc '[r] c B8.ByteString -- @ -- -- @"r-UNICODE.D76"@ and @"r-UTF8"@ is considered redundant for @T.Text@ and can be added or dropped as needed. -- -- (This library currently assumes that @"r-UTF8"@ includes the UNICODE.D76 restriction. This works well with -- assumptions made by 'T.Text'). -- -- Corresponding pairs reverse, this should be clear since the types are restricted to what @T.Text@ can store or to -- how @B8.Char@ works. -- -- Now consider any of the above diagrams, for instance, compare -- -- @ -- ETE.encodeUtf8 . ET.pack :: (Superset "r-UNICODE.D76" r, Superset "r-UTF8" r) => Enc '[r] c String -> Enc '[r] c B8.ByteString -- -- and -- EB8.pack :: (Superset "r-CHAR8" r) => Enc '[r] c String -> Enc '[r] c B8.ByteString -- @ -- -- What is the set of common values allowing us to use any of these 2 options? -- -- "r-UNICODE.D76" is not important here (it removes a range of Unicode values way above @\'\255\'@), what is the intersection of /UTF8/ and /CHAR8/ code point space? -- -- There are many character set encodings that utilize one byte (/CHAR8/) and /UTF8/ is different from all of them -- but it backward compatible only within the /ASCII/ range of chars @ < 127@. So the intersection should be /ASCII/, let us check that: -- -- -- @ -- ghci> :t ETE.encodeUtf8 . ET.pack @'["r-ASCII"] -- EncTe.encodeUtf8 . EncT.pack @'["r-ASCII"] -- :: Enc [Symbol] ((':) Symbol "r-ASCII" ('[] Symbol)) c String -- -> Enc [Symbol] ((':) Symbol "r-ASCII" ('[] Symbol)) c B8.ByteString -- -- ghci> :t EB8.pack @'["r-ASCII"] -- :: Enc [Symbol] ((':) Symbol "r-ASCII" ('[] Symbol)) c String -- -> Enc [Symbol] ((':) Symbol "r-ASCII" ('[] Symbol)) c B8.ByteString -- @ -- -- They both accept that common denominator. -- Now we could run a property test but it is clear that by the design these will match! -- -- Note, there is no @Superset "r-UNICODE.D76" "r-CHAR8"@ mapping, "r-CHAR8" supersets any -- 8-bit encoding like /ISO/IEC 8859/ family of encodings. This is by design even if structurally such -- definition would made sense. -- -- This choice effectively prevents anything classified under @"r-CHAR8"@ -- to end up as a visible encoding annotation in @Text@ (since that would made little sense as @Text@ is /UTF/ encoded). -- This is just one example of added type level security that /type-encoding/ provides. -- -- Currently, "r-CHAR8" is intended as upper bound on "r-" encodings only. There is no way -- to encode to it using provided encoding mechanisms (except for /unsafe/ options). -- Effectively the types -- -- @ -- Enc "r-CHAR8" c str -- @ -- -- can be viewed as uninhabited. -- -- However, @Char@ is often used instead of @Word8@ -- for low level @ByteString@ programming. This is supported with the @"r-ByteRep"@ annotation. -- -- @ -- Enc "r-ByteRep" c str -- @ -- -- this one can be used as @Superset "r-CHAR8" "r-ByteRep"@! -- That allows for @EncB8@ conversions to work on such data. -- However, there is no @Superset "r-UNICODE.D76" "r-ByteRep"@ so these cannot be converted to @Text@, which is -- exactly what is intended. -- -- -- == @Enc@ conversions -- -- Consider defining a conversion function @:: Enc xs c str1 -> f (Enc xs c str2)@. -- -- One challenge is how do we know that @xs@ is a valid encoding stack also for @str2@? -- Should we constrain that? -- -- This is made even more difficult because this library uses (has to) orphan instances. -- -- The other challenge is how to ensure that, if the destination is partially or fully decoded, then -- it will decode without errors and the decoding will be meaningful. -- -- Current version does not impose any instance constraint about existence of the stack for @str2@. -- It is possible to not have one, in that case @decodeAll@ combinators will not be available. -- -- This is still useful as the payload could be safely extracted, to save to the database or do other things with it. -- -- Future versions of /typed-encoding/ may provide ways to ensure validity of the encoding stack for @str2@. module Data.TypedEncoding.Conv where