fuzzyset-0.1.1: Fuzzy set for approximate string matching

Safe HaskellNone
LanguageHaskell2010

Data.FuzzySet.Internal

Synopsis

Documentation

gramMap Source #

Arguments

:: Text

An input string

-> Size

The gram size n, which must be at least 2

-> HashMap Text Int

A mapping from n-gram keys to the number of occurrences of the key in the list returned by grams (i.e., the list of all n-length substrings of the input enclosed in hyphens).

Normalize the input string, call grams on the normalized input, and then translate the result to a HashMap with the n-grams as keys and Int values corresponding to the number of occurences of the key in the generated gram list.

>>> gramMap "xxxx" 2
fromList [("-x",1), ("xx",3), ("x-",1)]
>>> Data.HashMap.Strict.lookup "nts" (gramMap "intrent'srestaurantsomeoftrent'saunt'santswantsamtorentsomepants" 3)
Just 8

grams Source #

Arguments

:: Text

An input string

-> Size

The variable n, which must be at least 2

-> [Text]

A k-length list of grams of size n, with \(k = s − n + 3\)

Break apart the normalized input string into a list of n-grams. For instance, the string "Destroido Corp." is first normalized into the form "destroido corp", and then enclosed in hyphens, so that it becomes "-destroido corp-". The 3-grams generated from this normalized string are

"-de", "des", "est", "str", "tro", "roi", "oid", "ido", "do ", "o c", " co", "cor", "orp", "rp-"

Given a normalized string of length s, we take all substrings of length n, letting the offset range from \(0 \text{ to } s + 2 − n\). The number of n-grams for a normalized string of length s is thus \(s + 2 − n + 1 = s − n + 3\), where \(0 < n < s − 2\).