Safe Haskell	None
Language	Haskell2010

Data.FuzzySet.Internal

Synopsis

(|>) :: a -> (a -> b) -> b
matches :: FuzzySet -> HashMap Text Int -> HashMap Int Int
getMatches :: FuzzySet -> Text -> Double -> Int -> [(Double, Text)]
gramVector :: Text -> Int -> HashMap Text Int
grams :: Text -> Int -> [Text]

Documentation

(|>) :: a -> (a -> b) -> b infixl 1 Source #

Alternative syntax for the reverse function application operator (&), known also as the pipe operator.

matches Source #

Arguments

:: FuzzySet	The string set
-> HashMap Text Int	A sparse vector representation of the search string (generated by `gramVector`)
-> HashMap Int Int	A mapping from item index to the dot product between the corresponding entry of the set and the search string

Dot products used to compute the cosine similarity, which is the similarity score assigned to entries that match the search string in the fuzzy set.

getMatches Source #

Arguments

:: FuzzySet	The string set
-> Text	A string to search for
-> Double	Minimum score
-> Int	The gram size n, which must be at least 2
-> [(Double, Text)]	A list of results (score and matched value)

This function performs the actual task of querying a set for matches, supported by the other functions in this module. See Implementation for an explanation.

gramVector Source #

Arguments

:: Text	An input string
-> Int	The gram size n, which must be at least 2
-> HashMap Text Int	A sparse vector with the number of times a substring occurs in the normalized input string

Generate a list of n-grams (character substrings) from the normalized input and then translate this into a dictionary with the n-grams as keys mapping to the number of occurences of the substring in the list.

>>> gramVector "xxxx" 2
fromList [("-x",1), ("xx",3), ("x-",1)]

The substring "xx" appears three times in the normalized string:

>>> grams "xxxx" 2
["-x","xx","xx","xx","x-"]

>>> Data.HashMap.Strict.lookup "nts" (gramVector "intrent'srestaurantsomeoftrent'saunt'santswantsamtorentsomepants" 3)
Just 8

grams Source #

Arguments

:: Text	An input string
-> Int	The gram size n, which must be at least 2
-> [Text]	A list of n-grams

Break apart the input string into a list of n-grams. The string is first normalized and enclosed in hyphens. We then take all substrings of length n, letting the offset range from \(0 \text{ to } s + 2 − n\), where s is the length of the normalized input.

Example: The string "Destroido Corp." is first normalized to "destroido corp", and then enclosed in hyphens, so that it becomes "-destroido corp-". The trigrams generated from this normalized string are:

[ "-de"
, "des"
, "est"
, "str"
, "tro"
, "roi"
, "oid"
, "ido"
, "do "
, "o c"
, " co"
, "cor"
, "orp"
, "rp-"
]