| Safe Haskell | None |
|---|---|
| Language | Haskell2010 |
Data.FuzzySet.Internal
Documentation
(|>) :: a -> (a -> b) -> b infixl 1 Source #
Alternative syntax for the reverse function application operator (&),
known also as the pipe operator.
Arguments
| :: FuzzySet | The string set |
| -> HashMap Text Int | A sparse vector representation of the search string (generated by |
| -> HashMap Int Int | A mapping from item index to the dot product between the corresponding entry of the set and the search string |
Dot products used to compute the cosine similarity, which is the similarity score assigned to entries that match the search string in the fuzzy set.
Arguments
| :: FuzzySet | The string set |
| -> Text | A string to search for |
| -> Double | Minimum score |
| -> Int | The gram size n, which must be at least 2 |
| -> [(Double, Text)] | A list of results (score and matched value) |
This function performs the actual task of querying a set for matches, supported by the other functions in this module. See Implementation for an explanation.
Arguments
| :: Text | An input string |
| -> Int | The gram size n, which must be at least 2 |
| -> HashMap Text Int | A sparse vector with the number of times a substring occurs in the normalized input string |
Generate a list of n-grams (character substrings) from the normalized input and then translate this into a dictionary with the n-grams as keys mapping to the number of occurences of the substring in the list.
>>>gramVector "xxxx" 2fromList [("-x",1), ("xx",3), ("x-",1)]
The substring "xx" appears three times in the normalized string:
>>>grams "xxxx" 2["-x","xx","xx","xx","x-"]
>>>Data.HashMap.Strict.lookup "nts" (gramVector "intrent'srestaurantsomeoftrent'saunt'santswantsamtorentsomepants" 3)Just 8
Arguments
| :: Text | An input string |
| -> Int | The gram size n, which must be at least 2 |
| -> [Text] | A list of n-grams |
Break apart the input string into a list of n-grams. The string is
first normalized and enclosed in hyphens. We then take
all substrings of length n, letting the offset range from
\(0 \text{ to } s + 2 − n\), where s is the length of the normalized input.
Example:
The string "Destroido Corp." is first normalized to "destroido corp",
and then enclosed in hyphens, so that it becomes "-destroido corp-". The
trigrams generated from this normalized string are:
[ "-de" , "des" , "est" , "str" , "tro" , "roi" , "oid" , "ido" , "do " , "o c" , " co" , "cor" , "orp" , "rp-" ]