Copyright | © 2016–present Mark Karpov |
---|---|

License | BSD 3 clause |

Maintainer | Mark Karpov <markkarpov92@gmail.com> |

Stability | experimental |

Portability | portable |

Safe Haskell | None |

Language | Haskell2010 |

The module provides efficient implementations of various strings metric
algorithms. It works with strict `Text`

values.

**Note**: before version *0.3.0* the package used C implementations of
the algorithms under the hood. Beginning from version *0.3.0*, the
implementations are written in Haskell while staying almost as fast, see:

## Synopsis

- levenshtein :: Text -> Text -> Int
- levenshteinNorm :: Text -> Text -> Ratio Int
- damerauLevenshtein :: Text -> Text -> Int
- damerauLevenshteinNorm :: Text -> Text -> Ratio Int
- overlap :: Text -> Text -> Ratio Int
- jaccard :: Text -> Text -> Ratio Int
- hamming :: Text -> Text -> Maybe Int
- jaro :: Text -> Text -> Ratio Int
- jaroWinkler :: Text -> Text -> Ratio Int

# Levenshtein variants

levenshtein :: Text -> Text -> Int Source #

Return the Levenshtein distance between two `Text`

values. The
Levenshtein distance between two strings is the minimal number of
operations necessary to transform one string into another. For the
Levenshtein distance allowed operations are: deletion, insertion, and
substitution.

See also: https://en.wikipedia.org/wiki/Levenshtein_distance.

**Heads up**, before version *0.3.0* this function returned
`Natural`

.

levenshteinNorm :: Text -> Text -> Ratio Int Source #

Return the normalized Levenshtein distance between two `Text`

values.
Result is a non-negative rational number (represented as

), where 0 signifies no similarity between the
strings, while 1 means exact match.`Ratio`

`Natural`

See also: https://en.wikipedia.org/wiki/Levenshtein_distance.

**Heads up**, before version *0.3.0* this function returned

.`Ratio`

`Natural`

damerauLevenshtein :: Text -> Text -> Int Source #

Return the Damerau-Levenshtein distance between two `Text`

values. The
function works like `levenshtein`

, but the collection of allowed
operations also includes transposition of two *adjacent* characters.

See also: https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance.

**Heads up**, before version *0.3.0* this function returned
`Natural`

.

damerauLevenshteinNorm :: Text -> Text -> Ratio Int Source #

Return the normalized Damerau-Levenshtein distance between two `Text`

values. 0 signifies no similarity between the strings, while 1 means
exact match.

See also: https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance.

**Heads up**, before version *0.3.0* this function returned

.`Ratio`

`Natural`

# Treating inputs like sets

overlap :: Text -> Text -> Ratio Int Source #

Return the overlap coefficient for two `Text`

values. Returned value is
in the range from 0 (no similarity) to 1 (exact match). Return 1 if both
`Text`

values are empty.

See also: https://en.wikipedia.org/wiki/Overlap_coefficient.

*Since: 0.3.0*

jaccard :: Text -> Text -> Ratio Int Source #

Return the Jaccard similarity coefficient for two `Text`

values.
Returned value is in the range from 0 (no similarity) to 1 (exact match).
Return 1 if both

See also: https://en.wikipedia.org/wiki/Jaccard_index

*Since: 0.3.0*

# Other

hamming :: Text -> Text -> Maybe Int Source #

*O(n)* Return the Hamming distance between two `Text`

values. Hamming
distance is defined as the number of positions at which the corresponding
symbols are different. The input `Text`

values should be of equal length
or `Nothing`

will be returned.

See also: https://en.wikipedia.org/wiki/Hamming_distance.

**Heads up**, before version *0.3.0* this function returned

.`Maybe`

`Natural`

jaro :: Text -> Text -> Ratio Int Source #

Return the Jaro distance between two `Text`

values. Returned value is
in the range from 0 (no similarity) to 1 (exact match).

While the algorithm is pretty clear for artificial examples (like those
from the linked Wikipedia article), for *arbitrary* strings, it may be
hard to decide which of two strings should be considered as one having
“reference” order of characters (order of matching characters in an
essential part of the definition of the algorithm). This makes us
consider the first string the “reference” string (with correct order of
characters). Thus generally,

jaro a b ≠ jaro b a

This asymmetry can be found in all implementations of the algorithm on the internet, AFAIK.

See also: https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance

**Heads up**, before version *0.3.0* this function returned

.`Ratio`

`Natural`

*Since: 0.2.0*

jaroWinkler :: Text -> Text -> Ratio Int Source #

Return the Jaro-Winkler distance between two `Text`

values. Returned
value is in range from 0 (no similarity) to 1 (exact match).

See also: https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance

**Heads up**, before version *0.3.0* this function returned

.`Ratio`

`Natural`

*Since: 0.2.0*