unicode-transforms: Unicode normalization

[ bsd3, data, library, text, unicode ] [ Propose Tags ]

Fast Unicode 8.0 normalization in Haskell (NFC, NFKC, NFD, NFKD).

[Skip to Readme]
Versions [faq], 0.2.0, 0.2.1, 0.3.0, 0.3.1, 0.3.2, 0.3.3, 0.3.4, 0.3.5
Change log Changelog.md
Dependencies base (>=4.7 && <5), bitarray (>=0.0.1 && <0.1), bytestring (>=0.9 && <0.11), text (>=1.1.1 && <1.3) [details]
License BSD-3-Clause
Copyright 2016 Harendra Kumar, 2014–2015 Antonio Nikishaev
Author Harendra Kumar
Maintainer harendra.kumar@gmail.com
Revised Revision 1 made by harendra at Thu Nov 10 01:30:47 UTC 2016
Category Data, Text, Unicode
Home page http://github.com/harendra-kumar/unicode-transforms
Bug tracker https://github.com/harendra-kumar/unicode-transforms/issues
Source repo head: git clone https://github.com/harendra-kumar/unicode-transforms
Uploaded by harendra at Sun Oct 23 18:20:34 UTC 2016
Distributions Arch:0.3.5, Debian:0.3.4, LTSHaskell:0.3.5, NixOS:0.3.5, Stackage:0.3.5, openSUSE:0.3.5
Downloads 7128 total (440 in the last 30 days)
Rating (no votes yet) [estimated by rule of succession]
Your Rating
  • λ
  • λ
  • λ
Status Hackage Matrix CI
Docs uploaded by user
Build status unknown [no reports yet]





Developer build


Use text-icu for benchmark and test comparisons


Use llvm backend (faster) for compilation


Use -f <flag> to enable a flag, or -f -<flag> to disable that flag. More info


Note: This package has metadata revisions in the cabal description newer than included in the tarball. To unpack the package including the revisions, use 'cabal get'.

Maintainer's Corner

For package maintainers and hackage trustees

Readme for unicode-transforms-0.2.0

[back to package description]

Unicode Transforms

Hackage Build Status Windows Build status Coverage Status

Fast Unicode 8.0 normalization in Haskell (NFC, NFKC, NFD, NFKD).

What is normalization?

Unicode characters with adornments (e.g. Á) can be represented in two different forms, as a single composed character (U+00C1 = Á) or as multiple decomposed characters (U+0041(A) U+0301( ́ ) = Á). They are differently encoded byte sequences but for humans they have exactly the same visual appearance.

A regular byte comparison may tell that two strings are different even though they might be equivalent. We need to convert both the strings in a normalized form using the Unicode Character Database before we can compare them for equivalence. For example:

>> import Data.Text.Normalize
>> normalize NFC "\193" == normalize NFC "\65\769"


Please use https://github.com/harendra-kumar/unicode-transforms to raise issues, or send pull requests.