text-metrics-0.2.0: Calculate various string metrics efficiently

Copyright© 2016 Mark Karpov
LicenseBSD 3 clause
MaintainerMark Karpov <markkarpov@openmailbox.org>
Stabilityexperimental
Portabilityportable
Safe HaskellNone
LanguageHaskell2010

Data.Text.Metrics

Contents

Description

The module provides efficient implementations of various strings metrics. It works with strict Text values and returns either Natural numbers (because the metrics cannot be negative), or Ratio Natural values because returned values are rational non-negative numbers by definition.

The functions provided here are the fastest implementations available for use in Haskell programs. In fact the functions are implemented in C for maximal efficiency, but this leads to a minor flaw. When we work with Text values in C, they are represented as UTF-16 encoded strings of two-byte values. The algorithms treat the strings as if a character corresponds to one element in such strings, which is true for almost all modern text data. However, there are characters that are represented by two adjoined elements in UTF-16: emoji, historic scripts, less used Chinese ideographs, and some more. If input Text of the functions contains such characters, the functions may return slightly incorrect result. Decide for yourself if this is acceptable for your use case, but chances are you will never run into situations when the functions produce incorrect results.

Synopsis

Levenshtein variants

levenshtein :: Text -> Text -> Natural Source #

Return Levenshtein distance between two Text values. Classic Levenshtein distance between two strings is minimal number of operations necessary to transform one string into another. For Levenshtein distance allowed operations are: deletion, insertion, and substitution.

See also: https://en.wikipedia.org/wiki/Levenshtein_distance.

levenshteinNorm :: Text -> Text -> Ratio Natural Source #

Return normalized Levenshtein distance between two Text values. Result is a non-negative rational number (represented as Ratio Natural), where 0 signifies no similarity between the strings, while 1 means exact match. The operation is virtually as fast as levenshtein.

See also: https://en.wikipedia.org/wiki/Levenshtein_distance.

damerauLevenshtein :: Text -> Text -> Natural Source #

Return Damerau-Levenshtein distance between two Text values. The function works like levenshtein, but the collection of allowed operations also includes transposition of two adjacent characters. The function is about 20% slower than levenshtein, but still pretty fast.

See also: https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance.

damerauLevenshteinNorm :: Text -> Text -> Ratio Natural Source #

Return normalized Damerau-Levenshtein distance between two Text values. Result is a non-negative rational number (represented as Ratio Natural), where 0 signifies no similarity between the strings, while 1 means exact match. The operation is virtually as fast as damerauLevenshtein.

See also: https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance.

Other

hamming :: Text -> Text -> Maybe Natural Source #

O(n) Return Hamming distance between two Text values. Hamming distance is defined as number of positions at which the corresponding symbols are different. The input Text values should be of equal length or Nothing will be returned.

See also: https://en.wikipedia.org/wiki/Hamming_distance.

jaro :: Text -> Text -> Ratio Natural Source #

Return Jaro distance between two Text values. Returned value is in range from 0 (no similarity) to 1 (exact match).

While the algorithm is pretty clear for artificial examples (like those from the linked Wikipedia article), for arbitrary strings, it may be hard to decide which of two strings should be considered as one having “reference” order of characters (since order of matching characters in an essential part of the definition of the algorithm). This makes us consider the first string the “reference” string (with correct order of characters). Thus generally,

jaro a b ≠ jaro b a

This asymmetry can be found in all implementations of the algorithm on the internet, AFAIK.

See also: http://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance

Since: 0.2.0

jaroWinkler :: Text -> Text -> Ratio Natural Source #

Return Jaro-Winkler distance between two Text values. Returned value is in range from 0 (no similarity) to 1 (exact match).

See also: http://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance

Since: 0.2.0