Z-Data-0.6.1.0: Array, vector and text

Z.Data.Text.Base

Description

A Text wrap a Bytes which will be interpreted using UTF-8 encoding. User should always use validate to construt a Text (instead of using construtor directly or coercing), otherwise illegal UTF-8 encoded codepoints will cause undefined behaviours.

Synopsis

# Text type

newtype Text Source #

Text represented as UTF-8 encoded Bytes

Constructors

 Text FieldsgetUTF8Bytes :: BytesExtract UTF-8 encoded Bytes from Text

#### Instances

Instances details
 Source # Instance detailsDefined in Z.Data.Text.Base Associated Typestype Item Text # MethodsfromList :: [Item Text] -> Text #fromListN :: Int -> [Item Text] -> Text #toList :: Text -> [Item Text] # Source # Instance detailsDefined in Z.Data.Text.Base Methods(==) :: Text -> Text -> Bool #(/=) :: Text -> Text -> Bool # Source # Instance detailsDefined in Z.Data.Text.Base Methodscompare :: Text -> Text -> Ordering #(<) :: Text -> Text -> Bool #(<=) :: Text -> Text -> Bool #(>) :: Text -> Text -> Bool #(>=) :: Text -> Text -> Bool #max :: Text -> Text -> Text #min :: Text -> Text -> Text # Source # Accepted syntax and escaping rules are same with String, which is different from Show instance. Instance detailsDefined in Z.Data.Text.Base Methods Source # The escaping rules is different from String 's Show instance, see "Z.Data.Text.Builder.escapeTextJSON" Instance detailsDefined in Z.Data.Text.Base MethodsshowsPrec :: Int -> Text -> ShowS #show :: Text -> String #showList :: [Text] -> ShowS # Source # Instance detailsDefined in Z.Data.Text.Base Methods Source # Instance detailsDefined in Z.Data.Text.Base Methods(<>) :: Text -> Text -> Text #stimes :: Integral b => b -> Text -> Text # Source # Instance detailsDefined in Z.Data.Text.Base Methodsmappend :: Text -> Text -> Text #mconcat :: [Text] -> Text # Source # Instance detailsDefined in Z.Data.Text.Base Methodsshrink :: Text -> [Text] # Source # Instance detailsDefined in Z.Data.Text.Base Methodscoarbitrary :: Text -> Gen b -> Gen b # Source # case fold with default locale. Instance detailsDefined in Z.Data.Text.Base MethodsfoldCaseList :: [Text] -> [Text] Source # Instance detailsDefined in Z.Data.Text.Base Methodsrnf :: Text -> () # Source # Instance detailsDefined in Z.Data.Text.Base MethodshashWithSalt :: Int -> Text -> Int #hash :: Text -> Int # Source # The escaping rules is same with Show instance: we reuse JSON escaping rules here, so it will be faster. Instance detailsDefined in Z.Data.Text.Print Methods Source # Instance detailsDefined in Z.Data.JSON.Base Methods JSON a => JSON (Map Text a) Source # Instance detailsDefined in Z.Data.JSON.Base MethodsencodeJSON :: Map Text a -> Builder () Source # JSON a => JSON (HashMap Text a) Source # default instance prefer later key Instance detailsDefined in Z.Data.JSON.Base Methods JSON a => JSON (FlatMap Text a) Source # default instance prefer later key Instance detailsDefined in Z.Data.JSON.Base Methods type Item Text Source # Instance detailsDefined in Z.Data.Text.Base type Item Text = Char

# Building text

O(n) Validate a sequence of bytes is UTF-8 encoded.

Throw InvalidUTF8Exception in case of invalid codepoint.

O(n) Validate a sequence of bytes is all ascii char byte(<128).

Throw InvalidASCIIException in case of invalid byte, It's not always faster than validate, use it only if you want to validate ASCII char sequences.

O(n) Validate a sequence of bytes is UTF-8 encoded.

Return Nothing in case of invalid codepoint.

O(n) Validate a sequence of bytes is all ascii char byte(<128).

Return Nothing in case of invalid byte.

O(n) Get the nth codepoint from Text, throw IndexOutOfTextRange when out of bound.

O(n) Get the nth codepoint from Text.

O(n) Find the nth codepoint's byte index (pointing to the nth char's begining byte).

The index is only meaningful to the whole byte slice, if there's less than n codepoints, the index will point to next byte after the end.

O(n) Get the nth codepoint from Text counting from the end, throw IndexOutOfVectorRange n callStack when out of bound.

O(n) Get the nth codepoint from Text counting from the end.

O(n) Find the nth codepoint's byte index from the end (pointing to the previous char's ending byte).

The index is only meaningful to the whole byte slice, if there's less than n codepoints, the index will point to previous byte before the start.

# Basic creating

O(1). Empty text.

O(1). Single char text.

O(n). Copy a text from slice.

O(n) replicate char n time.

cycleN :: Int -> Text -> Text Source #

O(n*m) cycleN a text n times.

# Conversion between list

O(n) Convert a string into a text

Alias for packN defaultInitSize, will be rewritten to a memcpy if possible.

O(n) Convert a list into a text with an approximate size(in bytes, not codepoints).

If the encoded bytes length is larger than the size given, we simply double the buffer size and continue building.

This function is a good consumer in the sense of build/foldr fusion.

O(n) Alias for packRN defaultInitSize.

O(n) packN in reverse order.

This function is a good consumer in the sense of build/foldr fusion.

O(n) Convert text to a char list.

Unpacking is done lazily. i.e. we will retain reference to the array until all element are consumed.

This function is a good producer in the sense of build/foldr fusion.

O(n) Convert text to a list in reverse order.

This function is a good producer in the sense of build/foldr fusion.

# Conversion between codepoint vector

O(n) convert from a char vector.

O(n) convert to a char vector.

# Basic interface

O(1) Test whether a text is empty.

O(n) The char length of a text.

O(m+n)

There's no need to guard empty vector because we guard them for you, so appending empty text are no-ops.

map' :: (Char -> Char) -> Text -> Text Source #

O(n) map f t is the Text obtained by applying f to each char of t. Performs replacement on invalid scalar values.

imap' :: (Int -> Char -> Char) -> Text -> Text Source #

Strict mapping with index.

foldl' :: (b -> Char -> b) -> b -> Text -> b Source #

Strict left to right fold.

ifoldl' :: (b -> Int -> Char -> b) -> b -> Text -> b Source #

Strict left to right fold with index.

foldr' :: (Char -> b -> b) -> b -> Text -> b Source #

Strict right to left fold

ifoldr' :: (Int -> Char -> b -> b) -> b -> Text -> b Source #

Strict right to left fold with index

NOTE: the index is counting from 0, not backwards

concat :: [Text] -> Text Source #

O(n) Concatenate a list of text.

Note: concat have to force the entire list to filter out empty text and calculate the length for allocation.

concatMap :: (Char -> Text) -> Text -> Text Source #

Map a function over a text and concatenate the results

## Special folds

count :: Char -> Text -> Int Source #

O(n) count returns count of an element from a text.

all :: (Char -> Bool) -> Text -> Bool Source #

O(n) Applied to a predicate and text, all determines if all chars of the text satisfy the predicate.

any :: (Char -> Bool) -> Text -> Bool Source #

O(n) Applied to a predicate and a text, any determines if any chars of the text satisfy the predicate.

## Text display width

Get the display width of a piece of text.

You shouldn't pass texts with control characters(<0x20, \DEL), which are counted with -1 width.

>>> displayWidth "你好世界！"
>>> 10
>>> displayWidth "hello world!"
>>> 12


## normalization

Constructors

 NormalizedYes NormalizedMaybe NormalizedNo

#### Instances

Instances details
 Source # Instance detailsDefined in Z.Data.Text.UTF8Rewind Methods Source # Instance detailsDefined in Z.Data.Text.UTF8Rewind Methods Source # Instance detailsDefined in Z.Data.Text.UTF8Rewind Methods Source # Instance detailsDefined in Z.Data.Text.UTF8Rewind Associated Typestype Rep NormalizationResult :: Type -> Type # Methods Source # Instance detailsDefined in Z.Data.Text.UTF8Rewind type Rep NormalizationResult = D1 ('MetaData "NormalizationResult" "Z.Data.Text.UTF8Rewind" "Z-Data-0.6.1.0-JeXatJjxwtL6zC0T0lLei4" 'False) (C1 ('MetaCons "NormalizedYes" 'PrefixI 'False) (U1 :: Type -> Type) :+: (C1 ('MetaCons "NormalizedMaybe" 'PrefixI 'False) (U1 :: Type -> Type) :+: C1 ('MetaCons "NormalizedNo" 'PrefixI 'False) (U1 :: Type -> Type)))

These are the Unicode Normalization Forms:

Form                         | Description
---------------------------- | ---------------------------------------------
Normalization Form D (NFD)   | Canonical decomposition
Normalization Form C (NFC)   | Canonical decomposition, followed by canonical composition
Normalization Form KD (NFKD) | Compatibility decomposition
Normalization Form KC (NFKC) | Compatibility decomposition, followed by canonical composition


Constructors

 NFC NFKC NFD NFKD

#### Instances

Instances details
 Source # Instance detailsDefined in Z.Data.Text.UTF8Rewind Methods Source # Instance detailsDefined in Z.Data.Text.UTF8Rewind Methods Source # Instance detailsDefined in Z.Data.Text.UTF8Rewind MethodsshowList :: [NormalizeMode] -> ShowS # Source # Instance detailsDefined in Z.Data.Text.UTF8Rewind Associated Typestype Rep NormalizeMode :: Type -> Type # Methods type Rep NormalizeMode Source # Instance detailsDefined in Z.Data.Text.UTF8Rewind type Rep NormalizeMode = D1 ('MetaData "NormalizeMode" "Z.Data.Text.UTF8Rewind" "Z-Data-0.6.1.0-JeXatJjxwtL6zC0T0lLei4" 'False) ((C1 ('MetaCons "NFC" 'PrefixI 'False) (U1 :: Type -> Type) :+: C1 ('MetaCons "NFKC" 'PrefixI 'False) (U1 :: Type -> Type)) :+: (C1 ('MetaCons "NFD" 'PrefixI 'False) (U1 :: Type -> Type) :+: C1 ('MetaCons "NFKD" 'PrefixI 'False) (U1 :: Type -> Type)))

Check if a string is stable in the NFC (Normalization Form C).

Check if a string is stable in the specified Unicode Normalization Form.

This function can be used as a preprocessing step, before attempting to normalize a string. Normalization is a very expensive process, it is often cheaper to first determine if the string is unstable in the requested normalization form.

The result of the check will be YES if the string is stable and MAYBE or NO if it is unstable. If the result is MAYBE, the string does not necessarily have to be normalized.

Normalize a string to NFC (Normalization Form C).

Normalize a string to the specified Unicode Normalization Form.

The Unicode standard defines two standards for equivalence between characters: canonical and compatibility equivalence. Canonically equivalent characters and sequence represent the same abstract character and must be rendered with the same appearance and behavior. Compatibility equivalent characters have a weaker equivalence and may be rendered differently.

Unicode Normalization Forms are formally defined standards that can be used to test whether any two strings of characters are equivalent to each other. This equivalence may be canonical or compatibility.

The algorithm puts all combining marks into a specified order and uses the rules for decomposition and composition to transform the string into one of four Unicode Normalization Forms. A binary comparison can then be used to determine equivalence.

## Case conversion

Get environment locale

Remove case distinction from UTF-8 encoded text with default locale.

Remove case distinction from UTF-8 encoded text.

Case folding is the process of eliminating differences between code points concerning case mapping. It is most commonly used for comparing strings in a case-insensitive manner. Conversion is fully compliant with the Unicode 7.0 standard.

Although similar to lowercasing text, there are significant differences. For one, case folding does _not_ take locale into account when converting. In some cases, case folding can be up to 20% faster than lowercasing the same text, but the result cannot be treated as correct lowercased text.

Only two locale-specific exception are made when case folding text. In Turkish, U+0049 LATIN CAPITAL LETTER I maps to U+0131 LATIN SMALL LETTER DOTLESS I and U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE maps to U+0069 LATIN SMALL LETTER I.

Although most code points can be case folded without changing length, there are notable exceptions. For example, U+0130 (LATIN CAPITAL LETTER I WITH DOT ABOVE) maps to "U+0069 U+0307" (LATIN SMALL LETTER I and COMBINING DOT ABOVE) when converted to lowercase.

Only a handful of scripts make a distinction between upper- and lowercase. In addition to modern scripts, such as Latin, Greek, Armenian and Cyrillic, a few historic or archaic scripts have case. The vast majority of scripts do not have case distinctions.

Convert UTF-8 encoded text to lowercase with default locale.

Convert UTF-8 encoded text to lowercase.

This function allows conversion of UTF-8 encoded strings to lowercase without first changing the encoding to UTF-32. Conversion is fully compliant with the Unicode 7.0 standard.

Although most code points can be converted to lowercase with changing length, there are notable exceptions. For example, U+0130 (LATIN CAPITAL LETTER I WITH DOT ABOVE) maps to "U+0069 U+0307" (LATIN SMALL LETTER I and COMBINING DOT ABOVE) when converted to lowercase.

Only a handful of scripts make a distinction between upper- and lowercase. In addition to modern scripts, such as Latin, Greek, Armenian and Cyrillic, a few historic or archaic scripts have case. The vast majority of scripts do not have case distinctions.

Case mapping is not reversible. That is, toUpper(toLower(x)) != toLower(toUpper(x)).

Certain code points (or combinations of code points) apply rules based on the locale. For more information about these exceptional code points, please refer to the Unicode standard: ftp://ftp.unicode.org/Public/UNIDATA/SpecialCasing.txt

Convert UTF-8 encoded text to uppercase with default locale.

Convert UTF-8 encoded text to uppercase.

Conversion is fully compliant with the Unicode 7.0 standard.

Although most code points can be converted without changing length, there are notable exceptions. For example, U+00DF (LATIN SMALL LETTER SHARP S) maps to "U+0053 U+0053" (LATIN CAPITAL LETTER S and LATIN CAPITAL LETTER S) when converted to uppercase.

Only a handful of scripts make a distinction between upper and lowercase. In addition to modern scripts, such as Latin, Greek, Armenian and Cyrillic, a few historic or archaic scripts have case. The vast majority of scripts do not have case distinctions.

Case mapping is not reversible. That is, toUpper(toLower(x)) != toLower(toUpper(x)).

Certain code points (or combinations of code points) apply rules based on the locale. For more information about these exceptional code points, please refer to the Unicode standard: ftp://ftp.unicode.org/Public/UNIDATA/SpecialCasing.txt

Convert UTF-8 encoded text to titlecase with default locale.

Convert UTF-8 encoded text to titlecase.

This function allows conversion of UTF-8 encoded strings to titlecase. Conversion is fully compliant with the Unicode 7.0 standard.

Titlecase requires a bit more explanation than uppercase and lowercase, because it is not a common text transformation. Titlecase uses uppercase for the first letter of each word and lowercase for the rest. Words are defined as "collections of code points with general category Lu, Ll, Lt, Lm or Lo according to the Unicode database".

Effectively, any type of punctuation can break up a word, even if this is not grammatically valid. This happens because the titlecasing algorithm does not and cannot take grammar rules into account.

Text                                 | Titlecase
-------------------------------------|-------------------------------------
The running man                      | The Running Man
NATO Alliance                        | Nato Alliance
You're amazing at building libraries | You'Re Amazing At Building Libraries


Although most code points can be converted to titlecase without changing length, there are notable exceptions. For example, U+00DF (LATIN SMALL LETTER SHARP S) maps to "U+0053 U+0073" (LATIN CAPITAL LETTER S and LATIN SMALL LETTER S) when converted to titlecase.

Certain code points (or combinations of code points) apply rules based on the locale. For more information about these exceptional code points, please refer to the Unicode standard: ftp://ftp.unicode.org/Public/UNIDATA/SpecialCasing.txt

## Unicode category

Check if the input string conforms to the category specified by the flags.

This function can be used to check if the code points in a string are part of a category. Valid flags are members of the "list of categories". The category for a code point is defined as part of the entry in UnicodeData.txt, the data file for the Unicode code point database.

By default, the function will treat grapheme clusters as a single code point. This means that the following string:

Code point | Canonical combining class | General category      | Name
---------- | ------------------------- | --------------------- | ----------------------
U+0045     | 0                         | Lu (Uppercase letter) | LATIN CAPITAL LETTER E
U+0300     | 230                       | Mn (Non-spacing mark) | COMBINING GRAVE ACCENT


Will match with CategoryLetterUppercase in its entirety, because the COMBINING GRAVE ACCENT is treated as part of the grapheme cluster. This is useful when e.g. creating a text parser, because you do not have to normalize the text first.

If this is undesired behavior, specify the CategoryIgnoreGraphemeCluster flag.

In order to maintain backwards compatibility with POSIX functions like isdigit and isspace, compatibility flags have been provided. Note, however, that the result is only guaranteed to be correct for code points in the Basic Latin range, between U+0000 and 0+007F. Combining a compatibility flag with a regular category flag will result in undefined behavior.

spanCategory :: Category -> Text -> (Text, Text) Source #

Try to match as many code points with the matching category flags as possible and return the prefix and suffix.

# Constants

## Locale

type Locale = CSize Source #

Locale for case mapping.

## Category

type Category = CSize Source #

Unicode categories.

See isCategory, you can combine categories with bitwise or.

# Misc

Constructors

 InvalidUTF8Exception CallStack InvalidASCIIException CallStack IndexOutOfTextRange Int CallStack first payload is invalid char index EmptyText CallStack

#### Instances

Instances details
 Source # Instance detailsDefined in Z.Data.Text.Base MethodsshowList :: [TextException] -> ShowS # Source # Instance detailsDefined in Z.Data.Text.Base Methods