text-icu-0.8.0.2: Bindings to the ICU library
Copyright(c) 2015 Ben Hamilton
LicenseBSD-style
Maintainerbgertzfield@gmail.com
Stabilityexperimental
PortabilityGHC
Safe HaskellSafe-Inferred
LanguageHaskell98

Data.Text.ICU.Spoof

Description

String spoofing (confusability) checks for Unicode, implemented as bindings to the International Components for Unicode (ICU) uspoof library.

See UTR #36 and UTS #39 for detailed information about the underlying algorithms and databases used by this module.

Synopsis

Unicode spoof checking API

The spoofCheck, areConfusable, and getSkeleton functions analyze Unicode text for visually confusable (or "spoof") characters.

For example, Latin, Cyrillic, and Greek all contain unique Unicode values which appear nearly identical on-screen:

     A    0041    LATIN CAPITAL LETTER A
     Α    0391    GREEK CAPITAL LETTER ALPHA
     А    0410    CYRILLIC CAPITAL LETTER A
     Ꭺ    13AA    CHEROKEE LETTER GO
     ᴀ    1D00    LATIN LETTER SMALL CAPITAL A
     ᗅ    15C5    CANADIAN SYLLABICS CARRIER GHO
     A    FF21    FULLWIDTH LATIN CAPITAL LETTER A
     𐊠    102A0   CARIAN LETTER A
     𝐀    1D400   MATHEMATICAL BOLD CAPITAL A

and so on. To check a string for visually confusable characters:

  1. open an MSpoof
  2. optionally configure it with setChecks, setRestrictionLevel, and/or setAllowedLocales, then
  3. spoofCheck a single string, use areConfusable to check if two strings could be confused for each other, or use getSkeleton to precompute a "skeleton" string (similar to a hash code) which can be cached and re-used to quickly check (using Unicode string comparison) if two strings are confusable.

By default, these methods will use ICU's bundled copy of confusables.txt and confusablesWholeScript.txt, which could be out of date. To provide your own confusables databases, use openFromSource. (To avoid repeatedly parsing these databases, you can then serialize your configured MSpoof and later openFromSerialized to load the pre-parsed databases.)

data MSpoof Source #

Configurable spoof checker wrapping an opaque handle and optionally wrapping a previously serialized instance.

data OpenFromSourceParseError Source #

Exception thrown with openFromSource fails to parse one of the input files.

Constructors

OpenFromSourceParseError 

Fields

  • errFile :: OpenFromSourceParseErrorFile

    The file which could not be parsed.

  • parseError :: ParseError

    Parse error encountered opening a spoof checker from source.

data SpoofCheck Source #

Constructors

SingleScriptConfusable

Makes areConfusable report if both identifiers are both from the same script and are visually confusable. Does not affect spoofCheck.

MixedScriptConfusable

Makes areConfusable report if both identifiers are visually confusable and at least one identifier contains characters from more than one script.

Makes spoofCheck report if the identifier contains multiple scripts, and is confusable with some other identifier in a single script.

WholeScriptConfusable

Makes areConfusable report if each identifier is of a different single script, and the identifiers are visually confusable.

AnyCase

By default, spoof checks assume the strings have been processed through toCaseFold and only check lower-case identifiers. If this is set, spoof checks will check both upper and lower case identifiers.

RestrictionLevel

Checks that identifiers are no looser than the specified level passed to setRestrictionLevel.

Invisible

Checks the identifier for the presence of invisible characters, such as zero-width spaces, or character sequences that are likely not to display, such as multiple occurrences of the same non-spacing mark.

CharLimit

Checks whether the identifier contains only characters from a specified set (for example, via setAllowedLocales).

MixedNumbers

Checks that the identifier contains numbers from only a single script.

AllChecks

Enables all checks.

AuxInfo

Enables returning a SpoofCheck in the SpoofCheckResult.

data SpoofCheckResult Source #

Constructors

CheckOK

The string passed all configured spoof checks.

CheckFailed [SpoofCheck]

The string failed one or more spoof checks.

CheckFailedWithRestrictionLevel

The string failed one or more spoof checks, and failed to pass the configured restriction level.

Fields

data RestrictionLevel Source #

Constructors

ASCII

Checks that the string contains only Unicode values in the range ߝ inclusive.

SingleScriptRestrictive

Checks that the string contains only characters from a single script.

HighlyRestrictive

Checks that the string contains only characters from a single script, or from the combinations (Latin + Han + Hiragana + Katakana), (Latin + Han + Bopomofo), or (Latin + Han + Hangul).

ModeratelyRestrictive

Checks that the string contains only characters from the combinations (Latin + Cyrillic + Greek + Cherokee), (Latin + Han + Hiragana + Katakana), (Latin + Han + Bopomofo), or (Latin + Han + Hangul).

MinimallyRestrictive

Allows arbitrary mixtures of scripts.

Unrestrictive

Allows any valid identifiers, including characters outside of the Identifier Profile.

data SkeletonTypeOverride Source #

Constructors

SkeletonSingleScript

By default, getSkeleton builds skeletons which catch visually confusable characters across multiple scripts. Pass this flag to override that behavior and build skeletons which catch visually confusable characters across single scripts.

SkeletonAnyCase

By default, getSkeleton assumes the input string has already been passed through toCaseFold and is lower-case. Pass this flag to override that behavior and allow upper and lower-case strings.

Functions

open :: IO MSpoof Source #

Open a spoof checker for checking Unicode strings for lookalike security issues with default options (all SpoofChecks except CharLimit).

openFromSerialized :: ByteString -> IO MSpoof Source #

Open a spoof checker previously serialized to bytes using serialize. The returned MSpoof will retain a reference to the ForeignPtr inside the ByteString, so ensure its contents do not change for the lifetime of the lifetime of the returned value.

openFromSource :: (ByteString, ByteString) -> IO MSpoof Source #

Open a spoof checker with custom rules given the UTF-8 encoded contents of the confusables.txt and confusablesWholeScript.txt files as described in Unicode UAX #39.

getSkeleton :: MSpoof -> Maybe SkeletonTypeOverride -> Text -> IO Text Source #

Generates re-usable "skeleton" strings which can be used (via Unicode equality) to check if an identifier is confusable with some large set of existing identifiers.

If you cache the returned strings in storage, you must invalidate your cache any time the underlying confusables database changes (i.e., on ICU upgrade).

By default, assumes all input strings have been passed through toCaseFold and are lower-case. To change this, pass SkeletonAnyCase.

By default, builds skeletons which catch visually confusable characters across multiple scripts. Pass SkeletonSingleScript to override that behavior and build skeletons which catch visually confusable characters across single scripts.

getChecks :: MSpoof -> IO [SpoofCheck] Source #

Get the checks performed by a spoof checker.

setChecks :: MSpoof -> [SpoofCheck] -> IO () Source #

Configure the checks performed by a spoof checker.

getRestrictionLevel :: MSpoof -> IO (Maybe RestrictionLevel) Source #

Get the restriction level of a spoof checker.

setRestrictionLevel :: MSpoof -> RestrictionLevel -> IO () Source #

Configure the restriction level of a spoof checker.

getAllowedLocales :: MSpoof -> IO [String] Source #

Get the list of locale names allowed to be used with a spoof checker. (We don't use LocaleName since the root and default locales have no meaning here.)

setAllowedLocales :: MSpoof -> [String] -> IO () Source #

Get the list of locale names allowed to be used with a spoof checker. (We don't use LocaleName since the root and default locales have no meaning here.)

areConfusable :: MSpoof -> Text -> Text -> IO SpoofCheckResult Source #

Check if two strings could be confused with each other.

spoofCheck :: MSpoof -> Text -> IO SpoofCheckResult Source #

Checks if a string could be confused with any other.

serialize :: MSpoof -> IO ByteString Source #

Serializes the rules in this spoof checker to a byte array, suitable for re-use by openFromSerialized.

Only includes any data provided to openFromSource. Does not include any other state or configuration.