Copyright | (c) 2020 Composewell Technologies and Contributors |
---|---|
License | Apache-2.0 |
Maintainer | streamly@composewell.com |
Stability | experimental |
Safe Haskell | Safe-Inferred |
Language | Haskell2010 |
General character property related functions.
Synopsis
- data GeneralCategory
- = UppercaseLetter
- | LowercaseLetter
- | TitlecaseLetter
- | ModifierLetter
- | OtherLetter
- | NonSpacingMark
- | SpacingCombiningMark
- | EnclosingMark
- | DecimalNumber
- | LetterNumber
- | OtherNumber
- | ConnectorPunctuation
- | DashPunctuation
- | OpenPunctuation
- | ClosePunctuation
- | InitialQuote
- | FinalQuote
- | OtherPunctuation
- | MathSymbol
- | CurrencySymbol
- | ModifierSymbol
- | OtherSymbol
- | Space
- | LineSeparator
- | ParagraphSeparator
- | Control
- | Format
- | Surrogate
- | PrivateUse
- | NotAssigned
- generalCategoryAbbr :: GeneralCategory -> String
- generalCategory :: Char -> GeneralCategory
- isAlphabetic :: Char -> Bool
- isAlphaNum :: Char -> Bool
- isControl :: Char -> Bool
- isMark :: Char -> Bool
- isPrint :: Char -> Bool
- isPunctuation :: Char -> Bool
- isSeparator :: Char -> Bool
- isSymbol :: Char -> Bool
- isWhiteSpace :: Char -> Bool
- isLetter :: Char -> Bool
- isSpace :: Char -> Bool
- isAscii :: Char -> Bool
- isLatin1 :: Char -> Bool
- isAsciiUpper :: Char -> Bool
- isAsciiLower :: Char -> Bool
- isJamo :: Char -> Bool
- jamoNCount :: Int
- jamoLFirst :: Int
- jamoLCount :: Int
- jamoLIndex :: Char -> Maybe Int
- jamoLLast :: Int
- jamoVFirst :: Int
- jamoVCount :: Int
- jamoVIndex :: Char -> Maybe Int
- jamoVLast :: Int
- jamoTFirst :: Int
- jamoTCount :: Int
- jamoTIndex :: Char -> Maybe Int
- jamoTLast :: Int
- hangulFirst :: Int
- hangulLast :: Int
- isHangul :: Char -> Bool
- isHangulLV :: Char -> Bool
Unicode general categories
data GeneralCategory Source #
Unicode General Categories.
These classes are defined in the [Unicode Character Database](http:/www.unicode.orgreportstr44tr44-14.html#GC_Values_Table), part of the Unicode standard
Note: the classes must be in the same order they are listed in the Unicode Standard,
because some functions (e.g. generalCategory
) rely on the Enum
instance.
Since: 0.3.0
UppercaseLetter |
|
LowercaseLetter |
|
TitlecaseLetter |
|
ModifierLetter |
|
OtherLetter |
|
NonSpacingMark |
|
SpacingCombiningMark |
|
EnclosingMark |
|
DecimalNumber |
|
LetterNumber |
|
OtherNumber |
|
ConnectorPunctuation |
|
DashPunctuation |
|
OpenPunctuation |
|
ClosePunctuation |
|
InitialQuote |
|
FinalQuote |
|
OtherPunctuation |
|
MathSymbol |
|
CurrencySymbol |
|
ModifierSymbol |
|
OtherSymbol |
|
Space |
|
LineSeparator |
|
ParagraphSeparator |
|
Control |
|
Format |
|
Surrogate |
|
PrivateUse |
|
NotAssigned |
|
Instances
generalCategoryAbbr :: GeneralCategory -> String Source #
Abbreviation of GeneralCategory
used in the Unicode standard.
Since: 0.3.0
generalCategory :: Char -> GeneralCategory Source #
The Unicode general category of the character.
This property is defined in the column 2 of the UnicodeData
table.
This relies on the Enum
instance of GeneralCategory
, which must remain in the
same order as the categories are presented in the Unicode standard.
show (generalCategory c) == show (Data.Char.generalCategory c)
Since: 0.3.0
Character classification
isAlphabetic :: Char -> Bool Source #
Returns True
for alphabetic Unicode characters (lower-case, upper-case
and title-case letters, plus letters of caseless scripts and modifiers
letters).
Note: this function is not equivalent to
isAlpha
/ isLetter
:
isAlpha
matches the following general categories:UppercaseLetter
(Lu
)LowercaseLetter
(Ll
)TitlecaseLetter
(Lt
)ModifierLetter
(Lm
)OtherLetter
(Lo
)
whereas
isAlphabetic
matches:Uppercase
propertyLowercase
propertyTitlecaseLetter
(Lt
)ModifierLetter
(Lm
)OtherLetter
(Lo
)LetterNumber
(Nl
)Other_Alphabetic
property
Since: 0.3.0
isAlphaNum :: Char -> Bool Source #
Selects alphabetic or numeric Unicode characters.
This function returns True
if its argument has one of the
following GeneralCategory
s, or False
otherwise:
UppercaseLetter
LowercaseLetter
TitlecaseLetter
ModifierLetter
OtherLetter
DecimalNumber
LetterNumber
OtherNumber
isAlphaNum c == Data.Char.isAlphaNum c
Since: 0.3.0
isControl :: Char -> Bool Source #
Selects control characters, which are the non-printing characters of the Latin-1 subset of Unicode.
This function returns True
if its argument has the GeneralCategory
Control
.
isControl c == Data.Char.isControl c
Since: 0.3.0
isMark :: Char -> Bool Source #
Selects Unicode mark characters, for example accents and the like, which combine with preceding characters.
This function returns True
if its argument has one of the
following GeneralCategory
s, or False
otherwise:
isMark c == Data.Char.isMark c
Since: 0.3.0
isPrint :: Char -> Bool Source #
Selects printable Unicode characters (letters, numbers, marks, punctuation, symbols and spaces).
This function returns False
if its argument has one of the
following GeneralCategory
s, or True
otherwise:
isPrint c == Data.Char.isPrint c
Since: 0.3.0
isPunctuation :: Char -> Bool Source #
Selects Unicode punctuation characters, including various kinds of connectors, brackets and quotes.
This function returns True
if its argument has one of the
following GeneralCategory
s, or False
otherwise:
ConnectorPunctuation
DashPunctuation
OpenPunctuation
ClosePunctuation
InitialQuote
FinalQuote
OtherPunctuation
isPunctuation c == Data.Char.isPunctuation c
Since: 0.3.0
isSeparator :: Char -> Bool Source #
Selects Unicode space and separator characters.
This function returns True
if its argument has one of the
following GeneralCategory
s, or False
otherwise:
isSeparator c == Data.Char.isSeparator c
Since: 0.3.0
isSymbol :: Char -> Bool Source #
Selects Unicode symbol characters, including mathematical and currency symbols.
This function returns True
if its argument has one of the
following GeneralCategory
s, or False
otherwise:
* MathSymbol
* CurrencySymbol
* ModifierSymbol
* OtherSymbol
isSymbol c == Data.Char.isSymbol c
Since: 0.3.0
isWhiteSpace :: Char -> Bool Source #
Returns True
for any whitespace characters, and the control
characters \t
, \n
, \r
, \f
, \v
.
See: Unicode White_Space
.
Note: isWhiteSpace
is not equivalent to isSpace
.
isWhiteSpace
selects the same characters from isSpace
plus the following:
U+0085
NEXT LINE (NEL)U+2028
LINE SEPARATORU+2029
PARAGRAPH SEPARATOR
Since: 0.3.0
isLetter :: Char -> Bool Source #
Deprecated: Use isAlphabetic instead. Note that the behavior of this function does not match base:Data.Char.isLetter. See Unicode.Char.General.Compat for behavior compatible with base:Data.Char.
Returns True
for alphabetic Unicode characters (lower-case, upper-case
and title-case letters, plus letters of caseless scripts and modifiers
letters).
Since: 0.1.0
isSpace :: Char -> Bool Source #
Deprecated: Use isWhiteSpace instead. Note that the behavior of this function does not match base:Data.Char.isSpace. See Unicode.Char.General.Compat for behavior compatible with base:Data.Char.
Returns True
for any whitespace characters, and the control
characters \t
, \n
, \r
, \f
, \v
.
Since: 0.1.0
Re-export
Selects the first 128 characters of the Unicode character set, corresponding to the ASCII character set.
Selects the first 256 characters of the Unicode character set, corresponding to the ISO 8859-1 (Latin-1) character set.
isAsciiUpper :: Char -> Bool #
isAsciiLower :: Char -> Bool #
Korean Hangul Characters
The Hangul script used in the Korean writing system consists of individual consonant and vowel letters (jamo) that are visually combined into square display cells to form entire syllable blocks. Hangul syllables may be encoded directly as precomposed combinations of individual jamo or as decomposed sequences of conjoining jamo. Modern Hangul syllable blocks can be expressed with either two or three jamo, either in the form consonant + vowel or in the form consonant + vowel + consonant. The leading consonant is represented as L, the vowel as V and the trailing consonant as T.
The Unicode Standard contains both a large set of precomposed modern Hangul syllables and a set of conjoining Hangul jamo, which can be used to encode archaic Korean syllable blocks as well as modern Korean syllable blocks.
Hangul characters can be composed or decomposed algorithmically instead of via mappings. These APIs are used mainly for Unicode normalization of Hangul text.
Please refer to the following resources for more information:
- The
Hangul
section of theEast Asia
chapter of the Unicode Standard - Conformance chapter of the Unicode Standard
- Unicode® Standard Annex #15 - Unicode Normalization Forms
- UCD file
HangulSyllableType.txt
- https://en.wikipedia.org/wiki/Hangul_Jamo_(Unicode_block)
- https://en.wikipedia.org/wiki/List_of_Hangul_jamo
Conjoining Jamo
Jamo L, V and T letters.
isJamo :: Char -> Bool Source #
Determine whether a character is a jamo L, V or T character.
Since: 0.1.0
jamoNCount :: Int Source #
Total count of all jamo characters.
jamoNCount = jamoVCount * jamoTCount
Since: 0.1.0
Jamo Leading (L)
jamoLFirst :: Int Source #
First leading consonant jamo.
Since: 0.1.0
jamoLCount :: Int Source #
Total count of leading consonant jamo.
Since: 0.3.0
jamoLIndex :: Char -> Maybe Int Source #
Given a Unicode character, if it is a leading jamo, return its index in
the list of leading jamo consonants, otherwise return Nothing
.
Since: 0.1.0
Jamo Vowel (V)
jamoVFirst :: Int Source #
First vowel jamo.
Since: 0.1.0
jamoVCount :: Int Source #
Total count of vowel jamo.
Since: 0.1.0
jamoVIndex :: Char -> Maybe Int Source #
Given a Unicode character, if it is a vowel jamo, return its index in the
list of vowel jamo, otherwise return Nothing
.
Since: 0.1.0
Jamo Trailing (T)
jamoTFirst :: Int Source #
The first trailing consonant jamo.
Note that jamoTFirst
does not represent a valid T, it represents a missing
T i.e. LV without a T. See comments under jamoTIndex
.
Since: 0.1.0
jamoTCount :: Int Source #
Total count of trailing consonant jamo.
Since: 0.1.0
jamoTIndex :: Char -> Maybe Int Source #
Given a Unicode character, if it is a trailing jamo consonant, return its
index in the list of trailing jamo consonants, otherwise return Nothing
.
Note that index 0 is not a valid index for a trailing consonant. Index 0 corresponds to an LV syllable, without a T. See "Hangul Syllable Decomposition" in the Conformance chapter of the Unicode standard for more details.
Since: 0.1.0
Hangul Syllables
Precomposed Hangul syllables.
hangulFirst :: Int Source #
Codepoint of the first pre-composed Hangul character.
Since: 0.1.0
hangulLast :: Int Source #
Codepoint of the last Hangul character.
Since: 0.1.0
isHangul :: Char -> Bool Source #
Determine if the given character is a precomposed Hangul syllable.
Since: 0.1.0
isHangulLV :: Char -> Bool Source #
Determine if the given character is a Hangul LV syllable.
Note: this function requires a precomposed Hangul syllable but does not
check it. Use isHangul
to check the input character before passing it to
isHangulLV
.
Since: 0.1.0