Copyright | © 2016 Mark Karpov |
---|---|
License | BSD 3 clause |
Maintainer | Mark Karpov <markkarpov@openmailbox.org> |
Stability | experimental |
Portability | portable |
Safe Haskell | None |
Language | Haskell2010 |
The module provides efficient implementations of various strings metrics.
It works with strict Text
values and returns either Natural
numbers
(because the metrics cannot be negative), or
values
because returned values are rational non-negative numbers by definition.Ratio
Natural
The functions provided here are the fastest implementations available for
use in Haskell programs. In fact the functions are implemented in C for
maximal efficiency, but this leads to a minor flaw. When we work with
Text
values in C, they are represented as UTF-16 encoded strings of
two-byte values. The algorithms treat the strings as if a character
corresponds to one element in such strings, which is true for almost all
modern text data. However, there are characters that are represented by
two adjoined elements in UTF-16: emoji, historic scripts, less used
Chinese ideographs, and some more. If input Text
of the functions
contains such characters, the functions may return slightly incorrect
result. Decide for yourself if this is acceptable for your use case, but
chances are you will never run into situations when the functions produce
incorrect results.
- levenshtein :: Text -> Text -> Natural
- levenshteinNorm :: Text -> Text -> Ratio Natural
- damerauLevenshtein :: Text -> Text -> Natural
- damerauLevenshteinNorm :: Text -> Text -> Ratio Natural
- hamming :: Text -> Text -> Maybe Natural
- jaro :: Text -> Text -> Ratio Natural
- jaroWinkler :: Text -> Text -> Ratio Natural
Levenshtein variants
levenshtein :: Text -> Text -> Natural Source #
Return Levenshtein distance between two Text
values. Classic
Levenshtein distance between two strings is minimal number of operations
necessary to transform one string into another. For Levenshtein distance
allowed operations are: deletion, insertion, and substitution.
See also: https://en.wikipedia.org/wiki/Levenshtein_distance.
levenshteinNorm :: Text -> Text -> Ratio Natural Source #
Return normalized Levenshtein distance between two Text
values.
Result is a non-negative rational number (represented as
), where 0 signifies no similarity between the strings, while 1
means exact match. The operation is virtually as fast as Ratio
Natural
levenshtein
.
See also: https://en.wikipedia.org/wiki/Levenshtein_distance.
damerauLevenshtein :: Text -> Text -> Natural Source #
Return Damerau-Levenshtein distance between two Text
values. The
function works like levenshtein
, but the collection of allowed
operations also includes transposition of two adjacent characters. The
function is about 20% slower than levenshtein
, but still pretty fast.
See also: https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance.
damerauLevenshteinNorm :: Text -> Text -> Ratio Natural Source #
Return normalized Damerau-Levenshtein distance between two Text
values. Result is a non-negative rational number (represented as
), where 0 signifies no similarity between the strings, while 1
means exact match. The operation is virtually as fast as
Ratio
Natural
damerauLevenshtein
.
See also: https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance.
Other
jaro :: Text -> Text -> Ratio Natural Source #
Return Jaro distance between two Text
values. Returned value is in
range from 0 (no similarity) to 1 (exact match).
While the algorithm is pretty clear for artificial examples (like those from the linked Wikipedia article), for arbitrary strings, it may be hard to decide which of two strings should be considered as one having “reference” order of characters (since order of matching characters in an essential part of the definition of the algorithm). This makes us consider the first string the “reference” string (with correct order of characters). Thus generally,
jaro a b ≠ jaro b a
This asymmetry can be found in all implementations of the algorithm on the internet, AFAIK.
See also: http://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance
Since: 0.2.0
jaroWinkler :: Text -> Text -> Ratio Natural Source #
Return Jaro-Winkler distance between two Text
values. Returned value
is in range from 0 (no similarity) to 1 (exact match).
See also: http://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance
Since: 0.2.0