| Portability | GHC | 
|---|---|
| Stability | experimental | 
| Maintainer | bos@serpentine.com | 
| Safe Haskell | Safe-Infered | 
Data.Text.ICU.Char
Contents
Description
Access to the Unicode Character Database, implemented as bindings to the International Components for Unicode (ICU) libraries.
Unicode assigns each code point (not just assigned character) values for many properties. Most are simple boolean flags, or constants from a small enumerated list. For some, values are relatively more complex types.
For more information see "About the Unicode Character Database" http://www.unicode.org/ucd/ and the ICU User Guide chapter on Properties http://icu-project.org/userguide/properties.html.
- class Property p v | p -> v
- data BidiClass_ = BidiClass
- data Block_ = Block
- data  Bool_ - = Alphabetic
- | ASCIIHexDigit
- | BidiControl
- | BidiMirrored
- | Dash
- | DefaultIgnorable
- | Deprecated
- | Diacritic
- | Extender
- | FullCompositionExclusion
- | GraphemeBase
- | GraphemeExtend
- | GraphemeLink
- | HexDigit
- | Hyphen
- | IDContinue
- | IDStart
- | Ideographic
- | IDSBinaryOperator
- | IDSTrinaryOperator
- | JoinControl
- | LogicalOrderException
- | Lowercase
- | Math
- | NonCharacter
- | QuotationMark
- | Radical
- | SoftDotted
- | TerminalPunctuation
- | UnifiedIdeograph
- | Uppercase
- | WhiteSpace
- | XidContinue
- | XidStart
- | CaseSensitive
- | STerm
- | VariationSelector
- | NFDInert
- | NFKDInert
- | NFCInert
- | NFKCInert
- | SegmentStarter
- | PatternSyntax
- | PatternWhiteSpace
- | POSIXAlNum
- | POSIXBlank
- | POSIXGraph
- | POSIXPrint
- | POSIXXDigit
 
- data Decomposition_ = Decomposition
- data EastAsianWidth_ = EastAsianWidth
- data GeneralCategory_ = GeneralCategory
- data HangulSyllableType_ = HangulSyllableType
- data JoiningGroup_ = JoiningGroup
- data JoiningType_ = JoiningType
- data NumericType_ = NumericType
- data CanonicalCombiningClass_ = CanonicalCombiningClass
- data LeadCanonicalCombiningClass_ = LeadCanonicalCombiningClass
- data TrailingCanonicalCombiningClass_ = TrailingCanonicalCombiningClass
- data NFCQuickCheck_ = NFCQuickCheck
- data NFDQuickCheck_ = NFDQuickCheck
- data NFKCQuickCheck_ = NFKCQuickCheck
- data NFKDQuickCheck_ = NFKDQuickCheck
- data GraphemeClusterBreak_ = GraphemeClusterBreak
- data LineBreak_ = LineBreak
- data SentenceBreak_ = SentenceBreak
- data WordBreak_ = WordBreak
- data  BlockCode - = NoBlock
- | BasicLatin
- | Latin1Supplement
- | LatinExtendedA
- | LatinExtendedB
- | IPAExtensions
- | SpacingModifierLetters
- | CombiningDiacriticalMarks
- | GreekAndCoptic
- | Cyrillic
- | Armenian
- | Hebrew
- | Arabic
- | Syriac
- | Thaana
- | Devanagari
- | Bengali
- | Gurmukhi
- | Gujarati
- | Oriya
- | Tamil
- | Telugu
- | Kannada
- | Malayalam
- | Sinhala
- | Thai
- | Lao
- | Tibetan
- | Myanmar
- | Georgian
- | HangulJamo
- | Ethiopic
- | Cherokee
- | UnifiedCanadianAboriginalSyllabics
- | Ogham
- | Runic
- | Khmer
- | Mongolian
- | LatinExtendedAdditional
- | GreekExtended
- | GeneralPunctuation
- | SuperscriptsAndSubscripts
- | CurrencySymbols
- | CombiningDiacriticalMarksForSymbols
- | LetterlikeSymbols
- | NumberForms
- | Arrows
- | MathematicalOperators
- | MiscellaneousTechnical
- | ControlPictures
- | OpticalCharacterRecognition
- | EnclosedAlphanumerics
- | BoxDrawing
- | BlockElements
- | GeometricShapes
- | MiscellaneousSymbols
- | Dingbats
- | BraillePatterns
- | CJKRadicalsSupplement
- | KangxiRadicals
- | IdeographicDescriptionCharacters
- | CJKSymbolsAndPunctuation
- | Hiragana
- | Katakana
- | Bopomofo
- | HangulCompatibilityJamo
- | Kanbun
- | BopomofoExtended
- | EnclosedCJKLettersAndMonths
- | CJKCompatibility
- | CJKUnifiedIdeographsExtensionA
- | CJKUnifiedIdeographs
- | YiSyllables
- | YiRadicals
- | HangulSyllables
- | HighSurrogates
- | HighPrivateUseSurrogates
- | LowSurrogates
- | PrivateUseArea
- | CJKCompatibilityIdeographs
- | AlphabeticPresentationForms
- | ArabicPresentationFormsA
- | CombiningHalfMarks
- | CJKCompatibilityForms
- | SmallFormVariants
- | ArabicPresentationFormsB
- | Specials
- | HalfwidthAndFullwidthForms
- | OldItalic
- | Gothic
- | Deseret
- | ByzantineMusicalSymbols
- | MusicalSymbols
- | MathematicalAlphanumericSymbols
- | CJKUnifiedIdeographsExtensionB
- | CJKCompatibilityIdeographsSupplement
- | Tags
- | CyrillicSupplement
- | Tagalog
- | Hanunoo
- | Buhid
- | Tagbanwa
- | MiscellaneousMathematicalSymbolsA
- | SupplementalArrowsA
- | SupplementalArrowsB
- | MiscellaneousMathematicalSymbolsB
- | SupplementalMathematicalOperators
- | KatakanaPhoneticExtensions
- | VariationSelectors
- | SupplementaryPrivateUseAreaA
- | SupplementaryPrivateUseAreaB
- | Limbu
- | TaiLe
- | KhmerSymbols
- | PhoneticExtensions
- | MiscellaneousSymbolsAndArrows
- | YijingHexagramSymbols
- | LinearBSyllabary
- | LinearBIdeograms
- | AegeanNumbers
- | Ugaritic
- | Shavian
- | Osmanya
- | CypriotSyllabary
- | TaiXuanJingSymbols
- | VariationSelectorsSupplement
- | AncientGreekMusicalNotation
- | AncientGreekNumbers
- | ArabicSupplement
- | Buginese
- | CJKStrokes
- | CombiningDiacriticalMarksSupplement
- | Coptic
- | EthiopicExtended
- | EthiopicSupplement
- | GeorgianSupplement
- | Glagolitic
- | Kharoshthi
- | ModifierToneLetters
- | NewTaiLue
- | OldPersian
- | PhoneticExtensionsSupplement
- | SupplementalPunctuation
- | SylotiNagri
- | Tifinagh
- | VerticalForms
- | N'Ko
- | Balinese
- | LatinExtendedC
- | LatinExtendedD
- | PhagsPa
- | Phoenician
- | Cuneiform
- | CuneiformNumbersAndPunctuation
- | CountingRodNumerals
- | Sundanese
- | Lepcha
- | OlChiki
- | CyrillicExtendedA
- | Vai
- | CyrillicExtendedB
- | Saurashtra
- | KayahLi
- | Rejang
- | Cham
- | AncientSymbols
- | PhaistosDisc
- | Lycian
- | Carian
- | Lydian
- | MahjongTiles
- | DominoTiles
 
- data  Direction - = LeftToRight
- | RightToLeft
- | EuropeanNumber
- | EuropeanNumberSeparator
- | EuropeanNumberTerminator
- | ArabicNumber
- | CommonNumberSeparator
- | BlockSeparator
- | SegmentSeparator
- | WhiteSpaceNeutral
- | OtherNeutral
- | LeftToRightEmbedding
- | LeftToRightOverride
- | RightToLeftArabic
- | RightToLeftEmbedding
- | RightToLeftOverride
- | PopDirectionalFormat
- | DirNonSpacingMark
- | BoundaryNeutral
 
- data Decomposition
- data EastAsianWidth
- data  GeneralCategory - = GeneralOtherType
- | UppercaseLetter
- | LowercaseLetter
- | TitlecaseLetter
- | ModifierLetter
- | OtherLetter
- | NonSpacingMark
- | EnclosingMark
- | CombiningSpacingMark
- | DecimalDigitNumber
- | LetterNumber
- | OtherNumber
- | SpaceSeparator
- | LineSeparator
- | ParagraphSeparator
- | ControlChar
- | FormatChar
- | PrivateUseChar
- | Surrogate
- | DashPunctuation
- | StartPunctuation
- | EndPunctuation
- | ConnectorPunctuation
- | OtherPunctuation
- | MathSymbol
- | CurrencySymbol
- | ModifierSymbol
- | OtherSymbol
- | InitialPunctuation
- | FinalPunctuation
 
- data  HangulSyllableType - = LeadingJamo
- | VowelJamo
- | TrailingJamo
- | LVSyllable
- | LVTSyllable
 
- data  JoiningGroup - = Ain
- | Alaph
- | Alef
- | Beh
- | Beth
- | Dal
- | DalathRish
- | E
- | Feh
- | FinalSemkath
- | Gaf
- | Gamal
- | Hah
- | HamzaOnHehGoal
- | He
- | Heh
- | HehGoal
- | Heth
- | Kaf
- | Kaph
- | KnottedHeh
- | Lam
- | Lamadh
- | Meem
- | Mim
- | Noon
- | Nun
- | Pe
- | Qaf
- | Qaph
- | Reh
- | ReversedPe
- | Sad
- | Sadhe
- | Seen
- | Semkath
- | Shin
- | SwashKaf
- | SyriacWaw
- | Tah
- | Taw
- | TehMarbuta
- | Teth
- | Waw
- | Yeh
- | YehBarree
- | YehWithTail
- | Yudh
- | YudhHe
- | Zain
- | Fe
- | Khaph
- | Zhain
- | BurushaskiYehBarree
 
- data  JoiningType - = JoinCausing
- | DualJoining
- | LeftJoining
- | RightJoining
- | Transparent
 
- data NumericType
- data GraphemeClusterBreak
- data  LineBreak - = Ambiguous
- | LBAlphabetic
- | BreakBoth
- | BreakAfter
- | BreakBefore
- | MandatoryBreak
- | ContingentBreak
- | ClosePunctuation
- | CombiningMark
- | CarriageReturn
- | Exclamation
- | Glue
- | LBHyphen
- | LBIdeographic
- | Inseparable
- | InfixNumeric
- | LineFeed
- | Nonstarter
- | Numeric
- | OpenPunctuation
- | PostfixNumeric
- | PrefixNumeric
- | Quotation
- | ComplexContext
- | LBSurrogate
- | Space
- | BreakSymbols
- | Zwspace
- | NextLine
- | WordJoiner
- | H2
- | H3
- | JL
- | JT
- | JV
 
- data SentenceBreak
- data  WordBreak - = WBALetter
- | WBFormat
- | WBKatakana
- | WBMidLetter
- | WBMidNum
- | WBNumeric
- | WBExtendNumLet
- | WBCR
- | WBExtend
- | WBLF
- | WBMidNumLet
- | WBNewline
 
- blockCode :: Char -> BlockCode
- charFullName :: Char -> String
- charName :: Char -> String
- charFromFullName :: String -> Maybe Char
- charFromName :: String -> Maybe Char
- combiningClass :: Char -> Int
- direction :: Char -> Direction
- property :: Property p v => p -> Char -> v
- isoComment :: Char -> String
- isMirrored :: Char -> Bool
- mirror :: Char -> Char
- digitToInt :: Char -> Maybe Int
- numericValue :: Char -> Maybe Double
Working with character properties
The property function provides the main view onto the Unicode Character
 Database.  Because Unicode character properties have a variety of types,
 the property function is polymorphic.  The type of its first argument
 dictates the type of its result, by use of the Property typeclass.
For instance, property AlphabeticBool, while property
 NFCQuickCheckMaybe Bool
class Property p v | p -> vSource
Instances
Property identifier types
Constructors
| Alphabetic | |
| ASCIIHexDigit | 0-9, A-F, a-f | 
| BidiControl | Format controls which have specific functions in the Bidi Algorithm. | 
| BidiMirrored | Characters that may change display in RTL text. | 
| Dash | Variations of dashes. | 
| DefaultIgnorable | Ignorable in most processing. | 
| Deprecated | The usage of deprecated characters is strongly discouraged. | 
| Diacritic | Characters that linguistically modify the meaning of another character to which they apply. | 
| Extender | Extend the value or shape of a preceding alphabetic character, e.g. length and iteration marks. | 
| FullCompositionExclusion | |
| GraphemeBase | For programmatic determination of grapheme cluster boundaries. | 
| GraphemeExtend | For programmatic determination of grapheme cluster boundaries. | 
| GraphemeLink | For programmatic determination of grapheme cluster boundaries. | 
| HexDigit | Characters commonly used for hexadecimal numbers. | 
| Hyphen | Dashes used to mark connections between pieces of words, plus the Katakana middle dot. | 
| IDContinue | Characters that can continue an identifier. | 
| IDStart | Characters that can start an identifier. | 
| Ideographic | CJKV ideographs. | 
| IDSBinaryOperator | For programmatic determination of Ideographic Description Sequences. | 
| IDSTrinaryOperator | |
| JoinControl | Format controls for cursive joining and ligation. | 
| LogicalOrderException | Characters that do not use logical order and require special handling in most processing. | 
| Lowercase | |
| Math | |
| NonCharacter | Code points that are explicitly defined as illegal for the encoding of characters. | 
| QuotationMark | |
| Radical | For programmatic determination of Ideographic Description Sequences. | 
| SoftDotted | Characters with a soft dot, like i or j. An accent placed on these characters causes the dot to disappear. | 
| TerminalPunctuation | Punctuation characters that generally mark the end of textual units. | 
| UnifiedIdeograph | For programmatic determination of Ideographic Description Sequences. | 
| Uppercase | |
| WhiteSpace | |
| XidContinue | 
 | 
| XidStart | 
 | 
| CaseSensitive | Either the source of a case mapping or in the target of a case
 mapping. Not the same as the general category  | 
| STerm | Sentence Terminal. Used in UAX #29: Text Boundaries http://www.unicode.org/reports/tr29/. | 
| VariationSelector | Indicates all those characters that qualify as Variation Selectors. For details on the behavior of these characters, see http://unicode.org/Public/UNIDATA/StandardizedVariants.html and 15.6 Variation Selectors. | 
| NFDInert | ICU-specific property for characters that are inert under NFD, i.e. they do not interact with adjacent characters. Used for example in normalizing transforms in incremental mode to find the boundary of safely normalizable text despite possible text additions. | 
| NFKDInert | ICU-specific property for characters that are inert under NFKD, i.e. they do not interact with adjacent characters. | 
| NFCInert | ICU-specific property for characters that are inert under NFC, i.e. they do not interact with adjacent characters. | 
| NFKCInert | ICU-specific property for characters that are inert under NFKC, i.e. they do not interact with adjacent characters. | 
| SegmentStarter | ICU-specific property for characters that are starters in terms of Unicode normalization and combining character sequences. | 
| PatternSyntax | See UAX #31 Identifier and Pattern Syntax http://www.unicode.org/reports/tr31/. | 
| PatternWhiteSpace | See UAX #31 Identifier and Pattern Syntax http://www.unicode.org/reports/tr31/. | 
| POSIXAlNum | Alphanumeric character class. | 
| POSIXBlank | Blank character class. | 
| POSIXGraph | Graph character class. | 
| POSIXPrint | Printable character class. | 
| POSIXXDigit | Hex digit character class. | 
data Decomposition_ Source
Constructors
| Decomposition | 
data EastAsianWidth_ Source
Constructors
| EastAsianWidth | 
data GeneralCategory_ Source
Constructors
| GeneralCategory | 
data HangulSyllableType_ Source
Constructors
| HangulSyllableType | 
data JoiningGroup_ Source
Constructors
| JoiningGroup | 
Combining class
data CanonicalCombiningClass_ Source
Constructors
| CanonicalCombiningClass | 
data LeadCanonicalCombiningClass_ Source
Constructors
| LeadCanonicalCombiningClass | 
data TrailingCanonicalCombiningClass_ Source
Constructors
| TrailingCanonicalCombiningClass | 
Normalization checking
data NFKCQuickCheck_ Source
Constructors
| NFKCQuickCheck | 
data NFKDQuickCheck_ Source
Constructors
| NFKDQuickCheck | 
Text boundaries
data GraphemeClusterBreak_ Source
Constructors
| GraphemeClusterBreak | 
data SentenceBreak_ Source
Constructors
| SentenceBreak | 
Property value types
Descriptions of Unicode blocks.
Constructors
The language directional property of a character set.
Constructors
data Decomposition Source
data EastAsianWidth Source
data GeneralCategory Source
Constructors
data HangulSyllableType Source
Constructors
| LeadingJamo | |
| VowelJamo | |
| TrailingJamo | |
| LVSyllable | |
| LVTSyllable | 
data JoiningGroup Source
Constructors
data JoiningType Source
Constructors
| JoinCausing | |
| DualJoining | |
| LeftJoining | |
| RightJoining | |
| Transparent | 
data NumericType Source
Text boundaries
Constructors
data SentenceBreak Source
Functions
blockCode :: Char -> BlockCodeSource
Return the Unicode allocation block that contains the given character.
charFullName :: Char -> StringSource
Return the full name of a Unicode character.
Compared to charName, this function gives each Unicode code point
 a unique extended name. Extended names are lowercase followed by an
 uppercase hexadecimal number, within angle brackets.
charName :: Char -> StringSource
Return the name of a Unicode character.
The names of all unassigned characters are empty.
The name contains only invariant characters like A-Z, 0-9, space, and '-'.
charFromFullName :: String -> Maybe CharSource
Find a Unicode character by its full or extended name, and return its code point value.
The name is matched exactly and completely.
A Unicode 1.0 name is matched only if it differs from the modern name.
Compared to charFromName, this function gives each Unicode code
 point a unique extended name. Extended names are lowercase followed
 by an uppercase hexadecimal number, within angle brackets.
charFromName :: String -> Maybe CharSource
Find a Unicode character by its full name, and return its code point value.
The name is matched exactly and completely.
A Unicode 1.0 name is matched only if it differs from the modern name. Unicode names are all uppercase.
combiningClass :: Char -> IntSource
direction :: Char -> DirectionSource
Return the bidirectional category value for the code point, which is used in the Unicode bidirectional algorithm (UAX #9 http://www.unicode.org/reports/tr9/).
isoComment :: Char -> StringSource
Return the ISO 10646 comment for a character.
If a character does not have an associated comment, the empty string is returned.
The ISO 10646 comment is an informative field in the Unicode
 Character Database (UnicodeData.txt field 11) and is from the ISO
 10646 names list.
isMirrored :: Char -> BoolSource
Determine whether the code point has the BidiMirrored property.  This
 property is set for characters that are commonly used in Right-To-Left
 contexts and need to be displayed with a mirrored glyph.
Conversion to numbers
digitToInt :: Char -> Maybe IntSource
Return the decimal digit value of a decimal digit character.
 Such characters have the general category Nd (decimal digit
 numbers) and a NumericType of NTDecimal.
No digit values are returned for any Han characters, because Han
 number characters are often used with a special Chinese-style
 number format (with characters for powers of 10 in between) instead
 of in decimal-positional notation.  Unicode 4 explicitly assigns
 Han number characters a NumericType of NTNumeric instead of
 NTDecimal.