unicode-collation: Haskell implementation of the Unicode Collation Algorithm

[ bsd2, library, text ] [ Propose Tags ] [ Report a vulnerability ]

This library provides a pure Haskell implementation of the Unicode Collation Algorithm described at http://www.unicode.org/reports/tr10/. It is not as fully-featured or as performant as text-icu, but it avoids a dependency on a large C library. Locale-specific tailorings are also provided.

[Skip to Readme]

Modules

[Index] [Quick Jump]

Text
- Text.Collate
  - Text.Collate.Lang
  - Text.Collate.Normalize

Flags

Automatic Flags

Name	Description	Default
doctests	Run doctests as part of test suite. Use with: `--write-ghc-environment-files=always`.	Disabled
executable	Build the unicode-collate executable.	Disabled

Use -f <flag> to enable a flag, or -f -<flag> to disable that flag. More info

Downloads

unicode-collation-0.1.3.6.tar.gz [browse] (Cabal source package)
Package description (revised from the package)

Note: This package has metadata revisions in the cabal description newer than included in the tarball. To unpack the package including the revisions, use 'cabal get'.

Maintainer's Corner

Package maintainers

JohnMacFarlane

For package maintainers and hackage trustees

edit package information

Candidates

0.1.3.3, 0.1.3.6

Versions [RSS]	0.1, 0.1.1, 0.1.2, 0.1.3, 0.1.3.1, 0.1.3.2, 0.1.3.3, 0.1.3.4, 0.1.3.5, 0.1.3.6
Change log	CHANGELOG.md
Dependencies	base (>=4.11 && <4.22), binary, bytestring, containers, parsec, template-haskell, text (>=1.2 && <2.2), th-lift-instances, unicode-collation [details]
Tested with	ghc ==8.4.4, ghc ==8.6.5, ghc ==8.8.3, ghc ==8.10.7, ghc ==9.0.1, ghc ==9.2.2, ghc ==9.4.2, ghc ==9.6.3, ghc ==9.8.1
License	BSD-2-Clause
Copyright	2021 John MacFarlane
Author	John MacFarlane
Maintainer	John MacFarlane <jgm@berkeley.edu>
Uploaded	by JohnMacFarlane at 2023-12-20T19:13:59Z
Revised	Revision 2 made by JohnMacFarlane at 2025-01-23T06:07:35Z
Category	Text
Home page	https://github.com/jgm/unicode-collation
Bug tracker	https://github.com/jgm/unicode-collation/issues
Source repo	head: git clone https://github.com/jgm/unicode-collation.git
Distributions	Arch:0.1.3.6, Fedora:0.1.3.6, LTSHaskell:0.1.3.6, NixOS:0.1.3.6, Stackage:0.1.3.6, openSUSE:0.1.3.6
Reverse Dependencies	4 direct, 175 indirect [details]
Executables	unicode-collate
Downloads	24322 total (50 in the last 30 days)
Rating	(no votes yet) [estimated by Bayesian average]
Your Rating	λ λ λ
Status	Docs available [build log] Last success reported on 2024-01-06 [all 1 reports]

Readme for unicode-collation-0.1.3.6

[back to package description]

unicode-collation

Haskell implementation of unicode collation algorithm.

Motivation

Previously there was no way to do correct unicode collation (sorting) in Haskell without depending on the C library icu and the barely maintained Haskell wrapper text-icu. This library offers a pure Haskell solution.

Conformance

The library passes all UCA conformance tests.

Localized collations have not been tested as extensively.

Performance

As might be expected, this library is slower than text-icu, which wraps a heavily optimized C library. How much slower depends quite a bit on the input.

On a sample of ten thousand random Unicode strings, we get a factor of about 3:

  sort a list of 10000 random Texts (en):
    5.9 ms ± 487 μs,  22 MB allocated, 899 KB copied
  sort same list with text-icu (en):
    2.1 ms ±  87 μs, 7.1 MB allocated, 148 KB copied

Performance is worse on a sample drawn from a smaller character set including predominantly composed accented letters, which mut be decomposed as part of the algorithm:

  sort a list of 10000 Texts (composed latin) (en):
     12 ms ± 1.1 ms,  34 MB allocated, 910 KB copied
  sort same list with text-icu (en):
    2.3 ms ±  56 μs, 7.0 MB allocated, 146 KB copied

Much of the impact here comes from normalization (decomposition). If we use a pre-normalized sample and disable normalization in the collator, it's much faster:

  sort same list but pre-normalized (en-u-kk-false):
    5.4 ms ± 168 μs,  19 MB allocated, 909 KB copied

On plain ASCII, we get a factor of 3 again:

  sort a list of 10000 ASCII Texts (en):
    4.6 ms ± 405 μs,  17 MB allocated, 880 KB copied
  sort same list with text-icu (en):
    1.6 ms ± 114 μs, 6.2 MB allocated, 130 KB copied

Note that this library does incremental normalization, so when strings can mostly be distinguished on the basis of the first two characters, as in the first sample, the impact is much less. On the other hand, performance is much slower on a sample of texts which differ only after the first 32 characters:

  sort a list of 10000 random Texts that agree in first 32 chars:
    116 ms ± 8.6 ms, 430 MB allocated, 710 KB copied
  sort same list with text-icu (en):
    3.2 ms ± 251 μs, 8.8 MB allocated, 222 KB copied

However, in the special case where the texts are identical, the algorithm can be short-circuited entirely and sorting is very fast:

  sort a list of 10000 identical Texts (en):
    877 μs ±  54 μs, 462 KB allocated, 9.7 KB copied

Localized collations

The following localized collations are available. For languages not listed here, the root collation is used.

af
ar
as
az
be
bn
ca
cs
cu
cy
da
de-AT-u-co-phonebk
de-u-co-phonebk
dsb
ee
eo
es
es-u-co-trad
et
fa
fi
fi-u-co-phonebk
fil
fo
fr-CA
gu
ha
haw
he
hi
hr
hu
hy
ig
is
ja
kk
kl
kn
ko
kok
lkt
ln
lt
lv
mk
ml
mr
mt
nb
nn
nso
om
or
pa
pl
ro
sa
se
si
si-u-co-dict
sk
sl
sq
sr
sv
sv-u-co-reformed
ta
te
th
tn
to
tr
ug-Cyrl
uk
ur
vi
vo
wae
wo
yo
zh
zh-u-co-big5han
zh-u-co-gb2312
zh-u-co-pinyin
zh-u-co-stroke
zh-u-co-zhuyin

Collation reordering (e.g. [reorder Latn Kana Hani]) is not suported

Data files

Version 13.0.0 of the Unicode data is used: http://www.unicode.org/Public/UCA/13.0.0/

Locale-specific tailorings are derived from the Perl module Unicode::Collate: https://cpan.metacpan.org/authors/id/S/SA/SADAHIRO/Unicode-Collate-1.29.tar.gz

Executable

The package includes an executable component, unicode-collate, which may be used for testing and for collating in scripts. To build it, enable the executable flag. For usage instructions, unicode-collate --help.

References

Unicode Technical Standard #35: Unicode Locale Data Markup Language (LDML): http://www.unicode.org/reports/tr35/
Unicode Technical Standard #10: Unicode Collation Algorithm: https://www.unicode.org/reports/tr10
Unicode Technical Standard #215: Unicode Normalization Forms: https://unicode.org/reports/tr15/