Safe Haskell | Safe-Inferred |
---|---|
Language | Haskell2010 |
Synopsis
- type CodePoint = Char
- type CodeUnit = Word8
- newtype CodeUnitIndex = CodeUnitIndex {
- codeUnitIndex :: Int
- data Text = Text !Array !Int !Int
- fromByteList :: [Word8] -> Text
- isCaseInvariant :: Text -> Bool
- lengthUtf8 :: Text -> CodeUnitIndex
- lowerCodePoint :: Char -> Char
- unlowerCodePoint :: Char -> [Char]
- lowerUtf8 :: Text -> Text
- toLowerAscii :: Char -> Char
- unicode2utf8 :: (Ord a, Num a, Bits a) => a -> [a]
- unpackUtf8 :: Text -> [CodeUnit]
- decode2 :: CodeUnit -> CodeUnit -> CodePoint
- decode3 :: CodeUnit -> CodeUnit -> CodeUnit -> CodePoint
- decode4 :: CodeUnit -> CodeUnit -> CodeUnit -> CodeUnit -> CodePoint
- decodeUtf8 :: [CodeUnit] -> [CodePoint]
- indexCodeUnit :: Text -> CodeUnitIndex -> CodeUnit
- unsafeIndexCodePoint :: Text -> CodeUnitIndex -> (CodeUnitIndex, CodePoint)
- unsafeIndexCodeUnit :: Text -> CodeUnitIndex -> CodeUnit
- skipCodePointsBackwards :: Text -> CodeUnitIndex -> Int -> CodeUnitIndex
- unsafeCutUtf8 :: CodeUnitIndex -> CodeUnitIndex -> Text -> (Text, Text)
- unsafeSliceUtf8 :: CodeUnitIndex -> CodeUnitIndex -> Text -> Text
- arrayContents :: Array -> Ptr Word8
- isArrayPinned :: Array -> Bool
- unsafeIndexCodePoint' :: Array -> CodeUnitIndex -> (CodeUnitIndex, CodePoint)
- unsafeIndexCodeUnit' :: Array -> CodeUnitIndex -> CodeUnit
- data BackwardsIter = BackwardsIter {}
- unsafeIndexEndOfCodePoint' :: Array -> CodeUnitIndex -> BackwardsIter
- unsafeIndexAnywhereInCodePoint' :: Array -> CodeUnitIndex -> BackwardsIter
- concat :: [Text] -> Text
- dropWhile :: (Char -> Bool) -> Text -> Text
- isInfixOf :: Text -> Text -> Bool
- null :: Text -> Bool
- pack :: String -> Text
- replicate :: Int -> Text -> Text
- unpack :: Text -> String
- indices :: Text -> Text -> [Int]
Documentation
type CodeUnit = Word8 Source #
A UTF-8 code unit is a byte. A Unicode code point can be encoded as up to four code units.
newtype CodeUnitIndex Source #
An index into the raw UTF-8 data of a Text
. This is not the code point
index as conventionally accepted by Text
, so we wrap it to avoid confusing
the two. Incorrect index manipulation can lead to surrogate pairs being
sliced, so manipulate indices with care. This type is also used for lengths.
Instances
A space efficient, packed, unboxed Unicode text type.
Instances
FromJSON Text | |
Defined in Data.Aeson.Types.FromJSON | |
FromJSONKey Text | |
Defined in Data.Aeson.Types.FromJSON | |
ToJSON Text | |
ToJSONKey Text | |
Defined in Data.Aeson.Types.ToJSON | |
Hashable Text | |
Defined in Data.Hashable.Class | |
type Item Text | |
fromByteList :: [Word8] -> Text Source #
isCaseInvariant :: Text -> Bool Source #
Return whether text has exactly one case variation, such that this function will not return true when Aho–Corasick would differentiate when doing case-insensitive matching.
lengthUtf8 :: Text -> CodeUnitIndex Source #
The return value of this function is not really an index. However the signature is supposed to make it clear that the length is returned in terms of code units, not code points.
lowerCodePoint :: Char -> Char Source #
Lower-Case a UTF-8 codepoint.
Uses toLowerAscii
for ASCII and toLower
otherwise.
unlowerCodePoint :: Char -> [Char] Source #
toLowerAscii :: Char -> Char Source #
Lower-case the ASCII code points A-Z and leave the rest of ASCII intact.
unicode2utf8 :: (Ord a, Num a, Bits a) => a -> [a] Source #
Convert a Unicode Code Point c
into a list of UTF-8 code units (bytes).
unpackUtf8 :: Text -> [CodeUnit] Source #
Decoding
Functions that turns code unit sequences into code point sequences.
decode2 :: CodeUnit -> CodeUnit -> CodePoint Source #
Decode 2 UTF-8 code units into their code point. The given code units should have the following format:
┌───────────────┬───────────────┐ │1 1 0 x x x x x│1 0 x x x x x x│ └───────────────┴───────────────┘
decode3 :: CodeUnit -> CodeUnit -> CodeUnit -> CodePoint Source #
Decode 3 UTF-8 code units into their code point. The given code units should have the following format:
┌───────────────┬───────────────┬───────────────┐ │1 1 1 0 x x x x│1 0 x x x x x x│1 0 x x x x x x│ └───────────────┴───────────────┴───────────────┘
decode4 :: CodeUnit -> CodeUnit -> CodeUnit -> CodeUnit -> CodePoint Source #
Decode 4 UTF-8 code units into their code point. The given code units should have the following format:
┌───────────────┬───────────────┬───────────────┬───────────────┐ │1 1 1 1 0 x x x│1 0 x x x x x x│1 0 x x x x x x│1 0 x x x x x x│ └───────────────┴───────────────┴───────────────┴───────────────┘
decodeUtf8 :: [CodeUnit] -> [CodePoint] Source #
Decode a list of UTF-8 code units into a list of code points.
Indexing
Text
can be indexed by code units or code points.
A CodePoint
is a 21-bit Unicode code point and can consist of up to four code units.
A CodeUnit
is a single byte.
indexCodeUnit :: Text -> CodeUnitIndex -> CodeUnit Source #
Get the code unit at the given CodeUnitIndex
.
Performs bounds checking.
unsafeIndexCodePoint :: Text -> CodeUnitIndex -> (CodeUnitIndex, CodePoint) Source #
Does exactly the same thing as unsafeIndexCodePoint'
, but on Text
values.
unsafeIndexCodeUnit :: Text -> CodeUnitIndex -> CodeUnit Source #
skipCodePointsBackwards :: Text -> CodeUnitIndex -> Int -> CodeUnitIndex Source #
Scan backwards through the text until we've seen the specified number of codepoints. Assumes that the initial CodeUnitIndex is within a codepoint.
Slicing Functions
unsafeCutUtf8
and unsafeSliceUtf8
are used to retrieve slices of Text
values.
unsafeSliceUtf8 begin length
returns a substring of length length
starting at begin
.
unsafeSliceUtf8 begin length
returns a tuple of the "surrounding" substrings.
They satisfy the following property:
let (prefix, suffix) = unsafeCutUtf8 begin length t in concat [prefix, unsafeSliceUtf8 begin length t, suffix] == t
The following diagram visualizes the relevant offsets for begin = CodeUnitIndex 2
, length = CodeUnitIndex 6
and t = "BCDEFGHIJKL"
.
off off+len │ │ ▼ ▼ ──┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬── A│B│C│D│E│F│G│H│I│J│K│L│M│N ──┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴── ▲ ▲ │ │ off+begin off+begin+length unsafeSliceUtf8 begin length t == "DEFGHI" unsafeCutUtf8 begin length t == ("BC", "JKL")
The shown array is open at each end because in general, t
may be a slice as well.
WARNING: As their name implies, these functions are not (necessarily) bounds-checked. Use at your own risk.
:: CodeUnitIndex | Starting position of substring. |
-> CodeUnitIndex | Length of substring. |
-> Text | Initial string. |
-> (Text, Text) |
unsafeSliceUtf8 :: CodeUnitIndex -> CodeUnitIndex -> Text -> Text Source #
Functions on Arrays
Functions for working with Array
values.
arrayContents :: Array -> Ptr Word8 Source #
See byteArrayContents
.
isArrayPinned :: Array -> Bool Source #
See isByteArrayPinned
.
unsafeIndexCodePoint' :: Array -> CodeUnitIndex -> (CodeUnitIndex, CodePoint) Source #
Decode a code point at the given CodeUnitIndex
.
Returns garbage if there is no valid code point at that position.
Does not perform bounds checking.
See decode2
, decode3
and decode4
for the expected format of multi-byte code points.
unsafeIndexCodeUnit' :: Array -> CodeUnitIndex -> CodeUnit Source #
data BackwardsIter Source #
Intermediate state when you're iterating backwards through a Utf8 text.
BackwardsIter | |
|
unsafeIndexEndOfCodePoint' :: Array -> CodeUnitIndex -> BackwardsIter Source #
Similar to unsafeIndexCodePoint', but assumes that the given index is the end of a utf8 codepoint. It returns the decoded code point and the index _before_ the code point. The resulting index could be passed directly to unsafeIndexEndOfCodePoint' again to decode the _previous_ code point.
General Functions
Re-exported from Text
.