fuzzyset-0.2.3: Fuzzy set for approximate string matching
Safe HaskellNone
LanguageHaskell2010

Data.FuzzySet.Internal

Synopsis

Documentation

(|>) :: a -> (a -> b) -> b infixl 1 Source #

Alternative syntax for the reverse function application operator (&), known also as the pipe operator.

matches Source #

Arguments

:: FuzzySet

The string set

-> HashMap Text Int

A sparse vector representation of the search string (generated by gramVector)

-> HashMap Int Int

A mapping from item index to the dot product between the corresponding entry of the set and the search string

Dot products used to compute the cosine similarity, which is the similarity score assigned to entries that match the search string in the fuzzy set.

getMatches Source #

Arguments

:: FuzzySet

The string set

-> Text

A string to search for

-> Double

Minimum score

-> Int

The gram size n, which must be at least 2

-> [(Double, Text)]

A list of results (score and matched value)

This function performs the actual task of querying a set for matches, supported by the other functions in this module. See Implementation for an explanation.

gramVector Source #

Arguments

:: Text

An input string

-> Int

The gram size n, which must be at least 2

-> HashMap Text Int

A sparse vector with the number of times a substring occurs in the normalized input string

Generate a list of n-grams (character substrings) from the normalized input and then translate this into a dictionary with the n-grams as keys mapping to the number of occurences of the substring in the list.

>>> gramVector "xxxx" 2
fromList [("-x",1), ("xx",3), ("x-",1)]

The substring "xx" appears three times in the normalized string:

>>> grams "xxxx" 2
["-x","xx","xx","xx","x-"]
>>> Data.HashMap.Strict.lookup "nts" (gramVector "intrent'srestaurantsomeoftrent'saunt'santswantsamtorentsomepants" 3)
Just 8

grams Source #

Arguments

:: Text

An input string

-> Int

The gram size n, which must be at least 2

-> [Text]

A list of n-grams

Break apart the input string into a list of n-grams. The string is first normalized and enclosed in hyphens. We then take all substrings of length n, letting the offset range from \(0 \text{ to } s + 2 − n\), where s is the length of the normalized input.

Example: The string "Destroido Corp." is first normalized to "destroido corp", and then enclosed in hyphens, so that it becomes "-destroido corp-". The trigrams generated from this normalized string are:

[ "-de"
, "des"
, "est"
, "str"
, "tro"
, "roi"
, "oid"
, "ido"
, "do "
, "o c"
, " co"
, "cor"
, "orp"
, "rp-"
]