fuzzyset-0.2.3: Fuzzy set for approximate string matching
Copyright(c) 2017-2019 Johannes Hildén
LicenseBSD3
Maintainerhildenjohannes@gmail.com
Stabilityexperimental
PortabilityGHC
Safe HaskellNone
LanguageHaskell2010

Data.FuzzySet

Description

A fuzzy string set data structure for approximate string matching. This implementation is based on the Python and JavaScript libraries with similar names; fuzzyset.js, and the original fuzzyset Python library.

Synopsis

How to use this library

Make sure the OverloadedStrings pragma is enabled. After that, three steps are typically involved:

  1. Create a set using one of defaultSet, emptySet, or fromList.
  2. To add entries, use add, addToSet, or addMany.
  3. Query the set with get, getOne, getWithMinScore, or getOneWithMinScore.
>>> defaultSet `add` "Jurassic Park" `add` "Terminator" `add` "The Matrix" `getOne` "percolator"
Just "Terminator"
>>> defaultSet `add` "Shaggy Rogers" `add` "Fred Jones" `add` "Daphne Blake" `add` "Velma Dinkley" `get` "Shaggy Jones"
[(0.7692307692307693,"Shaggy Rogers"),(0.5,"Fred Jones")]

There are also a few functions to inspect sets: size, isEmpty, and values.

More examples

{-# LANGUAGE OverloadedStrings #-}
module Main where

import Data.FuzzySet

states = [ "Alabama"        , "Alaska"         , "American Samoa"            , "Arizona"       , "Arkansas"
         , "California"     , "Colorado"       , "Connecticut"               , "Delaware"      , "District of Columbia"
         , "Florida"        , "Georgia"        , "Guam"                      , "Hawaii"        , "Idaho"
         , "Illinois"       , "Indiana"        , "Iowa"                      , "Kansas"        , "Kentucky"
         , "Louisiana"      , "Maine"          , "Maryland"                  , "Massachusetts" , "Michigan"
         , "Minnesota"      , "Mississippi"    , "Missouri"                  , "Montana"       , "Nebraska"
         , "Nevada"         , "New Hampshire"  , "New Jersey"                , "New Mexico"    , "New York"
         , "North Carolina" , "North Dakota"   , "Northern Marianas Islands" , "Ohio"          , "Oklahoma"
         , "Oregon"         , "Pennsylvania"   , "Puerto Rico"               , "Rhode Island"  , "South Carolina"
         , "South Dakota"   , "Tennessee"      , "Texas"                     , "Utah"          , "Vermont"
         , "Virginia"       , "Virgin Islands" , "Washington"                , "West Virginia" , "Wisconsin"
         , "Wyoming" ]

statesSet = fromList states

main = mapM_ print (get statesSet "Burger Islands")

The output of this program is:

(0.7142857142857143,"Virgin Islands")
(0.5714285714285714,"Rhode Island")
(0.44,"Northern Marianas Islands")
(0.35714285714285715,"Maryland")

Using the definition of statesSet from previous example:

>>> get statesSet "Why-oh-me-ing"
[(0.5384615384615384,"Wyoming")]
>>> get statesSet "Connect a cat"
[(0.6923076923076923,"Connecticut")]
>>> get statesSet "Transylvania"
[(0.75,"Pennsylvania"),(0.3333333333333333,"California"),(0.3333333333333333,"Arkansas"),(0.3333333333333333,"Kansas")]
>>> get statesSet "CanOfSauce"
[(0.4,"Kansas")]
>>> get statesSet "Alaska"
[(1.0,"Alaska")]
>>> get statesSet "Alaskanbraskansas"
[(0.47058823529411764,"Arkansas"),(0.35294117647058826,"Kansas"),(0.35294117647058826,"Alaska"),(0.35294117647058826,"Alabama"),(0.35294117647058826,"Nebraska")]

Types

data FuzzySet Source #

Main fuzzy string set data type. Use emptySet, defaultSet, or fromList to create sets.

Instances

Instances details
Eq FuzzySet Source # 
Instance details

Defined in Data.FuzzySet.Types

Show FuzzySet Source # 
Instance details

Defined in Data.FuzzySet.Types

Default FuzzySet Source #

See defaultSet.

Instance details

Defined in Data.FuzzySet

Methods

def :: FuzzySet #

API

Initializing

emptySet Source #

Arguments

:: Int

Lower bound on gram sizes to use (inclusive)

-> Int

Upper bound on gram sizes to use (inclusive)

-> Bool

Whether or not to use the Levenshtein distance to determine the score

-> FuzzySet

An empty fuzzy string set

Initialize an empty FuzzySet.

defaultSet :: FuzzySet Source #

An empty FuzzySet with the following defaults:

  • Gram size lower: 2
  • Gram size upper: 3
  • Use Levenshtein distance: True

fromList :: [Text] -> FuzzySet Source #

Create a set from a list of entries, using the default settings.

fromList = addMany defaultSet

Adding

add Source #

Arguments

:: FuzzySet

Set to add the string to

-> Text

The new entry

-> FuzzySet

An updated set

Add an entry to the set, or do nothing if a key that matches the string already exists in the set.

addToSet Source #

Arguments

:: FuzzySet

Fuzzy string set to add the entry to

-> Text

The new entry

-> (FuzzySet, Bool)

The updated set and a boolean, which will be True if, and only if, the value was not already in the set

Add an entry, unless it is already present in the set. A pair is returned with the new set and a boolean which denotes whether or not anything was inserted.

addMany :: FuzzySet -> [Text] -> FuzzySet Source #

Add a list of entries to the set, in one go.

addMany = foldr (flip add)

Retrieving

get Source #

Arguments

:: FuzzySet

The fuzzy string set to compare the string against

-> Text

The string to search for

-> [(Double, Text)]

A list of results (score and matched value)

Try to match the given string against the entries in the set, using a minimum score of 0.33. Return a list of results ordered by similarity score, with the closest match first. Use getWithMinScore to specify a different threshold value.

getWithMinScore Source #

Arguments

:: Double

A minimum score

-> FuzzySet

The fuzzy string set to compare the string against

-> Text

The string to search for

-> [(Double, Text)]

A list of results (score and matched value)

Try to match a string against the entries in the set, and return a list of all results with a score greater than or equal to the specified minimum score (i.e., the first argument). The results are ordered by similarity score, with the closest match first.

getOne Source #

Arguments

:: FuzzySet

The fuzzy string set to compare the string against

-> Text

The string to search for

-> Maybe Text

The closest match, if one is found

Try to match the given string against the entries in the set, and return the closest match, if one is found. A minimum score of 0.33 is used. To specify a different threshold value, instead use getOneWithMinScore.

getOneWithMinScore Source #

Arguments

:: Double

A minimum score

-> FuzzySet

The fuzzy string set to compare the string against

-> Text

The string to search for

-> Maybe Text

The closest match, if one is found

Try to match the given string against the entries in the set using the specified minimum score and return the closest match, if one is found.

Inspecting

size :: FuzzySet -> Int Source #

Return the number of entries in the set.

>>> size (defaultSet `add` "map" `add` "cap")
2
>>> size (defaultSet `add` "bork" `add` "bork" `add` "bork")
1

isEmpty :: FuzzySet -> Bool Source #

Return a boolean indicating whether the set is empty.

>>> isEmpty (fromList [])
True
>>> isEmpty $ fromList ["Aramis", "Porthos", "Athos"]
False

values :: FuzzySet -> [Text] Source #

Return the elements of the set. No particular order is guaranteed.

>>> values (fromList ["bass", "craze", "space", "lace", "daze", "haze", "ace", "maze"])
["space","daze","bass","maze","ace","craze","lace","haze"]

Implementation

To determine the similarity between entries of the set and the search string, the algorithm translates the strings to vectors and then calculates a metric known as the cosine similarity between these. A detailed explanation, with interactive examples, can be found on the website for the JavaScript version of this library. A brief overview follows here.

Cosine similarity

The cosine similarity of two vectors \(A\) and \(B\) is given by the formula

\[ \frac{A \cdot B}{||A||\ ||B||} \]

where \(A \cdot B\) is the dot product of the two vectors, and \(||A||\) denotes the euclidean norm, or magnitude, of \(A\). The cosine similarity is a measure of the (cosine of the) angle between two vectors. Since we will only deal with vectors with non-negative components, the result of this operation is always in the range \([0, 1]\).

Gram vectors

The vector we are interested in has as its components the number of times a gram (substring) occurs in the (normalized version of the) string. The function gramVector takes an arbitrary string as input and returns this vector, in dictionary form:

>>> gramVector "Mississippi" 3
fromList [("pi-",1),("ssi",2),("sis",1),("iss",2),("-mi",1),("mis",1),("sip",1),("ppi",1),("ipp",1)]

This dictionary maps each n-gram key to to the number of times it occurs in the string. The below table makes it more evident that this can be thought of as a sparse vector.

Gram-mimisissssisissipippppipi-
Count112211111

Lookup

The FuzzySet data structure maintains a dictionary with all n-grams that occur in the entries of the set for different sizes of grams.

"mis" => [ { itemIndex = 4, gramCount = 1 }, { itemIndex = 11, gramCount = 2 } ]

To compute the cosine similarity score, the function getMatches queries the set for the grams that stem from the search string. Here is an example: Let's say we have a set where the string "coffee" appears, and want to search for the string "covfefe". The two strings translate to the following bigram vectors:

>>> gramVector "coffee" 2
fromList [("e-",1),("ff",1),("of",1),("co",1),("ee",1),("fe",1),("-c",1)]
>>> gramVector "covfefe" 2
fromList [("e-",1),("vf",1),("ef",1),("ov",1),("co",1),("fe",2),("-c",1)]
coffeecovfefe
-ccooffffeeee--ccoovvffee-e-
11111111111211

The non-zero entries common to both vectors are then:

-ccofee-
\(a_i\)1111
\(b_i\)1121

Dotting these we get \(1 \times 1 + 1 \times 1 + 1 \times 2 + 1 \times 1 = 5 \). The function matches computes these dot products and returns a dictionary with the matched indices as keys. If the entry appears at item index, say, 3 in the set's internal list, this would yield a key-value pair (3, 5) in the map.

>>> matches (defaultSet `add` "tea" `add` "biscuits" `add` "cake" `add` "coffee") (gramVector "covfefe" 2)
fromList [(2,2),(3,5)]

We now have the numerators of the cosine similarity scores. The data structure keeps track of the magnitudes of the set entries, so we just need to look this quantity up using the index:

{vectorMagnitude = 2.6457513110645907, normalizedEntry = "coffee"}

Multiplying this by the magnitude of the search string's vector,

>>> norm $ elems $ gramVector "covfefe" 2
3.1622776601683795

… we get 8.366600265340756, which is the denominator of the expression. So the similarity score for this entry of the set is

>>> 5/(3.1622776601683795 * 2.6457513110645907)
0.5976143046671968

And indeed, this is the score we get if the FuzzySet is initialized with the Levenshtein option set to False:

>>> (emptySet 2 3 False `add` "tea" `add` "biscuits" `add` "cake" `add` "coffee") `get` "covfefe"
[(0.5976143046671968,"coffee")]

Note that the above procedure is repeated for each gram size (starting with the highest) in the selected range, until we either get some results, or all sizes have been exhausted.

Finally, if the set was initialized with the Levenshtein distance option enabled (e.g., using defaultSet), then only the first 50 results are kept and a new score is computed based on the Levenshtein distance.

Orphan instances

Default FuzzySet Source #

See defaultSet.

Instance details

Methods

def :: FuzzySet #