text-metrics: Calculate various string metrics efficiently

[ algorithms, bsd3, library, text ] [ Propose Tags ]

Calculate various string metrics efficiently.

[Skip to Readme]

Modules

[Index]

Data
- Text
  - Data.Text.Metrics

Flags

Manual Flags

Name	Description	Default
dev	Turn on development settings.	Disabled

Use -f <flag> to enable a flag, or -f -<flag> to disable that flag. More info

Downloads

text-metrics-0.2.0.tar.gz [browse] (Cabal source package)
Package description (revised from the package)

Note: This package has metadata revisions in the cabal description newer than included in the tarball. To unpack the package including the revisions, use 'cabal get'.

Maintainer's Corner

Package maintainers

mrkkrp

For package maintainers and hackage trustees

edit package information

Candidates

No Candidates

Versions [RSS]	0.1.0, 0.2.0, 0.3.0, 0.3.1, 0.3.2
Change log	CHANGELOG.md
Dependencies	base (>=4.7 && <5.0), nats (>=1 && <2), text (>=0.2 && <1.3) [details]
License	BSD-3-Clause
Author	Mark Karpov <markkarpov@openmailbox.org>
Maintainer	Mark Karpov <markkarpov@openmailbox.org>
Revised	Revision 1 made by mrkkrp at 2017-05-26T15:33:36Z
Category	Text, Algorithms
Home page	https://github.com/mrkkrp/text-metrics
Bug tracker	https://github.com/mrkkrp/text-metrics/issues
Source repo	head: git clone https://github.com/mrkkrp/text-metrics.git
Uploaded	by mrkkrp at 2016-10-09T15:02:46Z
Distributions	Arch:0.3.2, Debian:0.3.0, Fedora:0.3.2, LTSHaskell:0.3.2, NixOS:0.3.2, Stackage:0.3.2
Reverse Dependencies	12 direct, 16 indirect [details]
Downloads	17073 total (119 in the last 30 days)
Rating	2.0 (votes: 1) [estimated by Bayesian average]
Your Rating	λ λ λ
Status	Docs available [build log] Last success reported on 2016-10-09 [all 1 reports]

Readme for text-metrics-0.2.0

[back to package description]

Text Metrics

The library provides efficient implementations of various strings metrics. It works with strict Text values and returns either Natural numbers (because the metrics cannot be negative), or Ratio Natural values because returned values are rational non-negative numbers by definition.

The functions provided here are the fastest implementations available for use in Haskell programs. In fact the functions are implemented in C for maximal efficiency, but this leads to a minor flaw. When we work with Text values in C, they are represented as UTF-16 encoded strings of two-byte values. The algorithms treat the strings as if a character corresponds to one element in such strings, which is true for almost all modern text data. However, there are characters that are represented by two adjoined elements in UTF-16: emoji, historic scripts, less used Chinese ideographs, and some more. If input Text of the functions contains such characters, the functions may return slightly incorrect result. Decide for yourself if this is acceptable for your use case, but chances are you will never run into situations when the functions produce incorrect results.

The current version of the package implements:

TODO list:

Comparison with the `edit-distance` package

There is edit-distance package whose scope overlaps with the scope of this package. The differences are:

edit-distance allows to specify costs for every operation when calculating Levenshtein distance (insertion, deletion, substitution, and transposition). This is rarely needed though in real-world applications, IMO.
edit-distance only provides single Levenshtein distance, text-metrics aims to provide implementations of most string metrics algorithms.
edit-distance works on Strings, while text-metrics works on strict Text values.
As README.md of edit-distance states, “[the algorithms] have been fairly heavily optimized”, which is apparently true, yet the text-metrics is faster for short strings (under 64 characters) and even faster for longer strings (scales better). How much faster? For short strings more than ×3, and about ×4 for longer strings.

Implementation

All “meat” of the algorithms is written in C in a rather straightforward way. Levenshtein variants are based on the “iterative algorithm with two matrix rows” from Wikipedia with additional improvement that we do not copy current row of distances into previous row, but just swap the pointers (which is OK, since the arrays have equal length and current row will be overwritten in the next iteration anyway).

Normalized versions are defined as thin (inlined) Haskell wrappers.

License

Distributed under BSD 3 clause license.

text-metrics: Calculate various string metrics efficiently

Modules

Flags

Manual Flags

Downloads

Maintainer's Corner

Readme for text-metrics-0.2.0

Text Metrics

Comparison with the edit-distance package

Implementation

License

Comparison with the `edit-distance` package