Copyright | (c) 2017-2019 Johannes Hildén |
---|---|

License | BSD3 |

Maintainer | hildenjohannes@gmail.com |

Stability | experimental |

Portability | GHC |

Safe Haskell | None |

Language | Haskell2010 |

A fuzzy string set data structure for approximate string matching. This implementation is based on the Python and JavaScript libraries with similar names; fuzzyset.js, and the original fuzzyset Python library.

## Synopsis

- data FuzzySet
- emptySet :: Int -> Int -> Bool -> FuzzySet
- defaultSet :: FuzzySet
- fromList :: [Text] -> FuzzySet
- add :: FuzzySet -> Text -> FuzzySet
- addToSet :: FuzzySet -> Text -> (FuzzySet, Bool)
- addMany :: FuzzySet -> [Text] -> FuzzySet
- get :: FuzzySet -> Text -> [(Double, Text)]
- getWithMinScore :: Double -> FuzzySet -> Text -> [(Double, Text)]
- getOne :: FuzzySet -> Text -> Maybe Text
- getOneWithMinScore :: Double -> FuzzySet -> Text -> Maybe Text
- size :: FuzzySet -> Int
- isEmpty :: FuzzySet -> Bool
- values :: FuzzySet -> [Text]

# How to use this library

Make sure the `OverloadedStrings`

pragma is enabled. After that, three steps
are typically involved:

- Create a set using one of
`defaultSet`

,`emptySet`

, or`fromList`

. - To add entries, use
`add`

,`addToSet`

, or`addMany`

. - Query the set with
`get`

,`getOne`

,`getWithMinScore`

, or`getOneWithMinScore`

.

`>>>`

Just "Terminator"`defaultSet `add` "Jurassic Park" `add` "Terminator" `add` "The Matrix" `getOne` "percolator"`

`>>>`

[(0.7692307692307693,"Shaggy Rogers"),(0.5,"Fred Jones")]`defaultSet `add` "Shaggy Rogers" `add` "Fred Jones" `add` "Daphne Blake" `add` "Velma Dinkley" `get` "Shaggy Jones"`

There are also a few functions to inspect sets: `size`

, `isEmpty`

, and `values`

.

## More examples

{-# LANGUAGE OverloadedStrings #-} module Main where import Data.FuzzySet states = [ "Alabama" , "Alaska" , "American Samoa" , "Arizona" , "Arkansas" , "California" , "Colorado" , "Connecticut" , "Delaware" , "District of Columbia" , "Florida" , "Georgia" , "Guam" , "Hawaii" , "Idaho" , "Illinois" , "Indiana" , "Iowa" , "Kansas" , "Kentucky" , "Louisiana" , "Maine" , "Maryland" , "Massachusetts" , "Michigan" , "Minnesota" , "Mississippi" , "Missouri" , "Montana" , "Nebraska" , "Nevada" , "New Hampshire" , "New Jersey" , "New Mexico" , "New York" , "North Carolina" , "North Dakota" , "Northern Marianas Islands" , "Ohio" , "Oklahoma" , "Oregon" , "Pennsylvania" , "Puerto Rico" , "Rhode Island" , "South Carolina" , "South Dakota" , "Tennessee" , "Texas" , "Utah" , "Vermont" , "Virginia" , "Virgin Islands" , "Washington" , "West Virginia" , "Wisconsin" , "Wyoming" ] statesSet = fromList states main = mapM_ print (get statesSet "Burger Islands")

The output of this program is:

(0.7142857142857143,"Virgin Islands") (0.5714285714285714,"Rhode Island") (0.44,"Northern Marianas Islands") (0.35714285714285715,"Maryland")

Using the definition of `statesSet`

from previous example:

>>> get statesSet "Why-oh-me-ing" [(0.5384615384615384,"Wyoming")]

>>> get statesSet "Connect a cat" [(0.6923076923076923,"Connecticut")]

>>> get statesSet "Transylvania" [(0.75,"Pennsylvania"),(0.3333333333333333,"California"),(0.3333333333333333,"Arkansas"),(0.3333333333333333,"Kansas")]

>>> get statesSet "CanOfSauce" [(0.4,"Kansas")]

>>> get statesSet "Alaska" [(1.0,"Alaska")]

>>> get statesSet "Alaskanbraskansas" [(0.47058823529411764,"Arkansas"),(0.35294117647058826,"Kansas"),(0.35294117647058826,"Alaska"),(0.35294117647058826,"Alabama"),(0.35294117647058826,"Nebraska")]

# Types

Main fuzzy string set data type. Use `emptySet`

,
`defaultSet`

, or `fromList`

to create sets.

# API

## Initializing

:: Int | Lower bound on gram sizes to use (inclusive) |

-> Int | Upper bound on gram sizes to use (inclusive) |

-> Bool | Whether or not to use the Levenshtein distance to determine the score |

-> FuzzySet | An empty fuzzy string set |

Initialize an empty `FuzzySet`

.

defaultSet :: FuzzySet Source #

An empty `FuzzySet`

with the following defaults:

- Gram size lower:
`2`

- Gram size upper:
`3`

- Use Levenshtein distance:
`True`

fromList :: [Text] -> FuzzySet Source #

Create a set from a list of entries, using the default settings.

fromList = addMany defaultSet

## Adding

Add an entry to the set, or do nothing if a key that matches the string already exists in the set.

:: FuzzySet | Fuzzy string set to add the entry to |

-> Text | The new entry |

-> (FuzzySet, Bool) | The updated set and a boolean, which will be |

Add an entry, unless it is already present in the set. A pair is returned with the new set and a boolean which denotes whether or not anything was inserted.

addMany :: FuzzySet -> [Text] -> FuzzySet Source #

Add a list of entries to the set, in one go.

addMany = foldr (flip add)

## Retrieving

:: FuzzySet | The fuzzy string set to compare the string against |

-> Text | The string to search for |

-> [(Double, Text)] | A list of results (score and matched value) |

Try to match the given string against the entries in the set, using a
minimum score of 0.33. Return a list of results ordered by similarity score,
with the closest match first. Use `getWithMinScore`

to specify a different
threshold value.

:: Double | A minimum score |

-> FuzzySet | The fuzzy string set to compare the string against |

-> Text | The string to search for |

-> [(Double, Text)] | A list of results (score and matched value) |

Try to match a string against the entries in the set, and return a list of all results with a score greater than or equal to the specified minimum score (i.e., the first argument). The results are ordered by similarity score, with the closest match first.

:: FuzzySet | The fuzzy string set to compare the string against |

-> Text | The string to search for |

-> Maybe Text | The closest match, if one is found |

Try to match the given string against the entries in the set, and return
the closest match, if one is found. A minimum score of 0.33 is used. To
specify a different threshold value, instead use `getOneWithMinScore`

.

:: Double | A minimum score |

-> FuzzySet | The fuzzy string set to compare the string against |

-> Text | The string to search for |

-> Maybe Text | The closest match, if one is found |

Try to match the given string against the entries in the set using the specified minimum score and return the closest match, if one is found.

## Inspecting

size :: FuzzySet -> Int Source #

Return the number of entries in the set.

`>>>`

2`size (defaultSet `add` "map" `add` "cap")`

`>>>`

1`size (defaultSet `add` "bork" `add` "bork" `add` "bork")`

isEmpty :: FuzzySet -> Bool Source #

Return a boolean indicating whether the set is empty.

`>>>`

True`isEmpty (fromList [])`

`>>>`

False`isEmpty $ fromList ["Aramis", "Porthos", "Athos"]`

values :: FuzzySet -> [Text] Source #

Return the elements of the set. No particular order is guaranteed.

`>>>`

["space","daze","bass","maze","ace","craze","lace","haze"]`values (fromList ["bass", "craze", "space", "lace", "daze", "haze", "ace", "maze"])`

# Implementation

To determine the similarity between entries of the set and the search string,
the algorithm translates the strings to vectors and then calculates a metric
known as the *cosine similarity* between these. A detailed explanation, with
interactive examples, can be found on the website for the
JavaScript version of this library.
A brief overview follows here.

## Cosine similarity

The cosine similarity of two vectors \(A\) and \(B\) is given by the formula

\[ \frac{A \cdot B}{||A||\ ||B||} \]

where \(A \cdot B\) is the dot product of the two vectors, and \(||A||\)
denotes the euclidean norm, or *magnitude*,
of \(A\). The cosine similarity is a measure of the (cosine of the) angle
between two vectors. Since we will only deal with vectors with non-negative
components, the result of this operation is always in the range \([0, 1]\).

## Gram vectors

The vector we are interested in has as its components the number of times
a gram (substring) occurs in the (normalized version of the) string. The
function `gramVector`

takes an arbitrary string as input and returns this
vector, in dictionary form:

`>>>`

fromList [("pi-",1),("ssi",2),("sis",1),("iss",2),("-mi",1),("mis",1),("sip",1),("ppi",1),("ipp",1)]`gramVector "Mississippi" 3`

This dictionary maps each *n*-gram key to to the number of times it occurs
in the string. The below table makes it more evident that this can be thought
of as a sparse vector.

Gram | `-mi` | `mis` | `iss` | `ssi` | `sis` | `sip` | `ipp` | `ppi` | `pi-` |

Count | 1 | 1 | 2 | 2 | 1 | 1 | 1 | 1 | 1 |

## Lookup

The `FuzzySet`

data structure maintains a dictionary with all *n*-grams that
occur in the entries of the set for different sizes of grams.

"mis" => [ { itemIndex = 4, gramCount = 1 }, { itemIndex = 11, gramCount = 2 } ]

To compute the cosine similarity score, the function `getMatches`

queries the set for the grams that stem from the search string. Here is an
example: Let's say we have a set where the string `"coffee"`

appears, and
want to search for the string `"covfefe"`

. The two strings translate to the
following *bigram* vectors:

`>>>`

fromList [("e-",1),("ff",1),("of",1),("co",1),("ee",1),("fe",1),("-c",1)]`gramVector "coffee" 2`

`>>>`

fromList [("e-",1),("vf",1),("ef",1),("ov",1),("co",1),("fe",2),("-c",1)]`gramVector "covfefe" 2`

coffee | covfefe | ||||||||||||

`-c` | `co` | `of` | `ff` | `fe` | `ee` | `e-` | `-c` | `co` | `ov` | `vf` | `fe` | `e-` | `e-` |

1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 1 | 1 |

The non-zero entries common to both vectors are then:

`-c` | `co` | `fe` | `e-` | |

\(a_i\) | 1 | 1 | 1 | 1 |

\(b_i\) | 1 | 1 | 2 | 1 |

Dotting these we get \(1 \times 1 + 1 \times 1 + 1 \times 2 + 1 \times 1 = 5 \).
The function `matches`

computes these dot products and returns a dictionary
with the matched indices as keys. If the entry appears at item index, say,
3 in the set's internal list, this would yield a key-value pair `(3, 5)`

in
the map.

`>>>`

fromList [(2,2),(3,5)]`matches (defaultSet `add` "tea" `add` "biscuits" `add` "cake" `add` "coffee") (gramVector "covfefe" 2)`

We now have the numerators of the cosine similarity scores. The data structure keeps track of the magnitudes of the set entries, so we just need to look this quantity up using the index:

{vectorMagnitude = 2.6457513110645907, normalizedEntry = "coffee"}

Multiplying this by the magnitude of the search string's vector,

`>>>`

3.1622776601683795`norm $ elems $ gramVector "covfefe" 2`

… we get 8.366600265340756, which is the denominator of the expression. So the similarity score for this entry of the set is

`>>>`

0.5976143046671968`5/(3.1622776601683795 * 2.6457513110645907)`

And indeed, this is the score we get if the `FuzzySet`

is initialized with
the Levenshtein option set to `False`

:

`>>>`

[(0.5976143046671968,"coffee")]`(emptySet 2 3 False `add` "tea" `add` "biscuits" `add` "cake" `add` "coffee") `get` "covfefe"`

Note that the above procedure is repeated for each gram size (starting with the highest) in the selected range, until we either get some results, or all sizes have been exhausted.

Finally, if the set was initialized with the Levenshtein distance option
enabled (e.g., using `defaultSet`

), then only the first 50 results are kept
and a new score is computed based on the Levenshtein `distance`

.