text-icu-0.6.3.2: Bindings to the ICU library

PortabilityGHC
Stabilityexperimental
Maintainerbos@serpentine.com

Data.Text.ICU.Char

Contents

Description

Access to the Unicode Character Database, implemented as bindings to the International Components for Unicode (ICU) libraries.

Unicode assigns each code point (not just assigned character) values for many properties. Most are simple boolean flags, or constants from a small enumerated list. For some, values are relatively more complex types.

For more information see "About the Unicode Character Database" http://www.unicode.org/ucd/ and the ICU User Guide chapter on Properties http://icu-project.org/userguide/properties.html.

Synopsis

Working with character properties

The property function provides the main view onto the Unicode Character Database. Because Unicode character properties have a variety of types, the property function is polymorphic. The type of its first argument dictates the type of its result, by use of the Property typeclass.

For instance, property Alphabetic returns a Bool, while property NFCQuickCheck returns a Maybe Bool.

Property identifier types

data Block_ Source

Constructors

Block 

data Bool_ Source

Constructors

Alphabetic 
ASCIIHexDigit

0-9, A-F, a-f

BidiControl

Format controls which have specific functions in the Bidi Algorithm.

BidiMirrored

Characters that may change display in RTL text.

Dash

Variations of dashes.

DefaultIgnorable

Ignorable in most processing.

Deprecated

The usage of deprecated characters is strongly discouraged.

Diacritic

Characters that linguistically modify the meaning of another character to which they apply.

Extender

Extend the value or shape of a preceding alphabetic character, e.g. length and iteration marks.

FullCompositionExclusion 
GraphemeBase

For programmatic determination of grapheme cluster boundaries.

GraphemeExtend

For programmatic determination of grapheme cluster boundaries.

GraphemeLink

For programmatic determination of grapheme cluster boundaries.

HexDigit

Characters commonly used for hexadecimal numbers.

Hyphen

Dashes used to mark connections between pieces of words, plus the Katakana middle dot.

IDContinue

Characters that can continue an identifier.

IDStart

Characters that can start an identifier.

Ideographic

CJKV ideographs.

IDSBinaryOperator

For programmatic determination of Ideographic Description Sequences.

IDSTrinaryOperator 
JoinControl

Format controls for cursive joining and ligation.

LogicalOrderException

Characters that do not use logical order and require special handling in most processing.

Lowercase 
Math 
NonCharacter

Code points that are explicitly defined as illegal for the encoding of characters.

QuotationMark 
Radical

For programmatic determination of Ideographic Description Sequences.

SoftDotted

Characters with a soft dot, like i or j. An accent placed on these characters causes the dot to disappear.

TerminalPunctuation

Punctuation characters that generally mark the end of textual units.

UnifiedIdeograph

For programmatic determination of Ideographic Description Sequences.

Uppercase 
WhiteSpace 
XidContinue

IDContinue modified to allow closure under normalization forms NFKC and NFKD.

XidStart

IDStart modified to allow closure under normalization forms NFKC and NFKD.

CaseSensitive

Either the source of a case mapping or in the target of a case mapping. Not the same as the general category Cased_Letter.

STerm

Sentence Terminal. Used in UAX #29: Text Boundaries http://www.unicode.org/reports/tr29/.

VariationSelector

Indicates all those characters that qualify as Variation Selectors. For details on the behavior of these characters, see http://unicode.org/Public/UNIDATA/StandardizedVariants.html and 15.6 Variation Selectors.

NFDInert

ICU-specific property for characters that are inert under NFD, i.e. they do not interact with adjacent characters. Used for example in normalizing transforms in incremental mode to find the boundary of safely normalizable text despite possible text additions.

NFKDInert

ICU-specific property for characters that are inert under NFKD, i.e. they do not interact with adjacent characters.

NFCInert

ICU-specific property for characters that are inert under NFC, i.e. they do not interact with adjacent characters.

NFKCInert

ICU-specific property for characters that are inert under NFKC, i.e. they do not interact with adjacent characters.

SegmentStarter

ICU-specific property for characters that are starters in terms of Unicode normalization and combining character sequences.

PatternSyntax

See UAX #31 Identifier and Pattern Syntax http://www.unicode.org/reports/tr31/.

PatternWhiteSpace

See UAX #31 Identifier and Pattern Syntax http://www.unicode.org/reports/tr31/.

POSIXAlNum

Alphanumeric character class.

POSIXBlank

Blank character class.

POSIXGraph

Graph character class.

POSIXPrint

Printable character class.

POSIXXDigit

Hex digit character class.

Combining class

Normalization checking

Text boundaries

Property value types

data BlockCode Source

Descriptions of Unicode blocks.

Constructors

NoBlock 
BasicLatin 
Latin1Supplement 
LatinExtendedA 
LatinExtendedB 
IPAExtensions 
SpacingModifierLetters 
CombiningDiacriticalMarks 
GreekAndCoptic 
Cyrillic 
Armenian 
Hebrew 
Arabic 
Syriac 
Thaana 
Devanagari 
Bengali 
Gurmukhi 
Gujarati 
Oriya 
Tamil 
Telugu 
Kannada 
Malayalam 
Sinhala 
Thai 
Lao 
Tibetan 
Myanmar 
Georgian 
HangulJamo 
Ethiopic 
Cherokee 
UnifiedCanadianAboriginalSyllabics 
Ogham 
Runic 
Khmer 
Mongolian 
LatinExtendedAdditional 
GreekExtended 
GeneralPunctuation 
SuperscriptsAndSubscripts 
CurrencySymbols 
CombiningDiacriticalMarksForSymbols 
LetterlikeSymbols 
NumberForms 
Arrows 
MathematicalOperators 
MiscellaneousTechnical 
ControlPictures 
OpticalCharacterRecognition 
EnclosedAlphanumerics 
BoxDrawing 
BlockElements 
GeometricShapes 
MiscellaneousSymbols 
Dingbats 
BraillePatterns 
CJKRadicalsSupplement 
KangxiRadicals 
IdeographicDescriptionCharacters 
CJKSymbolsAndPunctuation 
Hiragana 
Katakana 
Bopomofo 
HangulCompatibilityJamo 
Kanbun 
BopomofoExtended 
EnclosedCJKLettersAndMonths 
CJKCompatibility 
CJKUnifiedIdeographsExtensionA 
CJKUnifiedIdeographs 
YiSyllables 
YiRadicals 
HangulSyllables 
HighSurrogates 
HighPrivateUseSurrogates 
LowSurrogates 
PrivateUseArea 
CJKCompatibilityIdeographs 
AlphabeticPresentationForms 
ArabicPresentationFormsA 
CombiningHalfMarks 
CJKCompatibilityForms 
SmallFormVariants 
ArabicPresentationFormsB 
Specials 
HalfwidthAndFullwidthForms 
OldItalic 
Gothic 
Deseret 
ByzantineMusicalSymbols 
MusicalSymbols 
MathematicalAlphanumericSymbols 
CJKUnifiedIdeographsExtensionB 
CJKCompatibilityIdeographsSupplement 
Tags 
CyrillicSupplement 
Tagalog 
Hanunoo 
Buhid 
Tagbanwa 
MiscellaneousMathematicalSymbolsA 
SupplementalArrowsA 
SupplementalArrowsB 
MiscellaneousMathematicalSymbolsB 
SupplementalMathematicalOperators 
KatakanaPhoneticExtensions 
VariationSelectors 
SupplementaryPrivateUseAreaA 
SupplementaryPrivateUseAreaB 
Limbu 
TaiLe 
KhmerSymbols 
PhoneticExtensions 
MiscellaneousSymbolsAndArrows 
YijingHexagramSymbols 
LinearBSyllabary 
LinearBIdeograms 
AegeanNumbers 
Ugaritic 
Shavian 
Osmanya 
CypriotSyllabary 
TaiXuanJingSymbols 
VariationSelectorsSupplement 
AncientGreekMusicalNotation 
AncientGreekNumbers 
ArabicSupplement 
Buginese 
CJKStrokes 
CombiningDiacriticalMarksSupplement 
Coptic 
EthiopicExtended 
EthiopicSupplement 
GeorgianSupplement 
Glagolitic 
Kharoshthi 
ModifierToneLetters 
NewTaiLue 
OldPersian 
PhoneticExtensionsSupplement 
SupplementalPunctuation 
SylotiNagri 
Tifinagh 
VerticalForms 
N'Ko 
Balinese 
LatinExtendedC 
LatinExtendedD 
PhagsPa 
Phoenician 
Cuneiform 
CuneiformNumbersAndPunctuation 
CountingRodNumerals 
Sundanese 
Lepcha 
OlChiki 
CyrillicExtendedA 
Vai 
CyrillicExtendedB 
Saurashtra 
KayahLi 
Rejang 
Cham 
AncientSymbols 
PhaistosDisc 
Lycian 
Carian 
Lydian 
MahjongTiles 
DominoTiles 

Text boundaries

Functions

blockCode :: Char -> BlockCodeSource

Return the Unicode allocation block that contains the given character.

charFullName :: Char -> StringSource

Return the full name of a Unicode character.

Compared to charName, this function gives each Unicode code point a unique extended name. Extended names are lowercase followed by an uppercase hexadecimal number, within angle brackets.

charName :: Char -> StringSource

Return the name of a Unicode character.

The names of all unassigned characters are empty.

The name contains only invariant characters like A-Z, 0-9, space, and '-'.

charFromFullName :: String -> Maybe CharSource

Find a Unicode character by its full or extended name, and return its code point value.

The name is matched exactly and completely.

A Unicode 1.0 name is matched only if it differs from the modern name.

Compared to charFromName, this function gives each Unicode code point a unique extended name. Extended names are lowercase followed by an uppercase hexadecimal number, within angle brackets.

charFromName :: String -> Maybe CharSource

Find a Unicode character by its full name, and return its code point value.

The name is matched exactly and completely.

A Unicode 1.0 name is matched only if it differs from the modern name. Unicode names are all uppercase.

direction :: Char -> DirectionSource

Return the bidirectional category value for the code point, which is used in the Unicode bidirectional algorithm (UAX #9 http://www.unicode.org/reports/tr9/).

property :: Property p v => p -> Char -> vSource

isoComment :: Char -> StringSource

Return the ISO 10646 comment for a character.

If a character does not have an associated comment, the empty string is returned.

The ISO 10646 comment is an informative field in the Unicode Character Database (UnicodeData.txt field 11) and is from the ISO 10646 names list.

isMirrored :: Char -> BoolSource

Determine whether the code point has the BidiMirrored property. This property is set for characters that are commonly used in Right-To-Left contexts and need to be displayed with a mirrored glyph.

Conversion to numbers

digitToInt :: Char -> Maybe IntSource

Return the decimal digit value of a decimal digit character. Such characters have the general category Nd (decimal digit numbers) and a NumericType of NTDecimal.

No digit values are returned for any Han characters, because Han number characters are often used with a special Chinese-style number format (with characters for powers of 10 in between) instead of in decimal-positional notation. Unicode 4 explicitly assigns Han number characters a NumericType of NTNumeric instead of NTDecimal.

numericValue :: Char -> Maybe DoubleSource

Return the numeric value for a Unicode code point as defined in the Unicode Character Database.

A Double return type is necessary because some numeric values are fractions, negative, or too large to fit in a fixed-width integral type.