Copyright | (c) 2017 Johannes Hildén |
---|---|
License | BSD3 |
Maintainer | hildenjohannes@gmail.com |
Stability | experimental |
Portability | GHC |
Safe Haskell | None |
Language | Haskell2010 |
A fuzzy string set data structure for approximate string matching. This implementation is based on the Python and JavaScript libraries with the same name:
- data FuzzySet
- type Size = Int
- mkSet :: Size -> Size -> Bool -> FuzzySet
- defaultSet :: FuzzySet
- fromList :: [Text] -> FuzzySet
- add :: FuzzySet -> Text -> FuzzySet
- addToSet :: FuzzySet -> Text -> (FuzzySet, Bool)
- addMany :: FuzzySet -> [Text] -> FuzzySet
- get :: FuzzySet -> Text -> [(Double, Text)]
- getWithMinScore :: Double -> FuzzySet -> Text -> [(Double, Text)]
- getOne :: FuzzySet -> Text -> Maybe Text
- size :: FuzzySet -> Int
- isEmpty :: FuzzySet -> Bool
- values :: FuzzySet -> [Text]
How to use
Make sure the OverloadedStrings
pragma is enabled. Then there are just
three steps:
- Create a set using one of
defaultSet
,mkSet
, orfromList
. - To add entries, use
add
,addToSet
, oraddMany
. - Then query the set with
get
,getOne
, orgetWithMinScore
.
>>>
defaultSet `add` "Jurassic Park" `add` "Terminator" `add` "The Matrix" `getOne` "percolator"
Just "Terminator"
>>>
defaultSet `add` "Shaggy Rogers" `add` "Fred Jones" `add` "Daphne Blake" `add` "Velma Dinkley" `get` "Shaggy Jones"
[(0.7692307692307693,"Shaggy Rogers"),(0.5,"Fred Jones")]
There are also a few functions to inspect the set.
More examples
{-# LANGUAGE OverloadedStrings #-} module Main where import Data.FuzzySet states = [ "Alabama" , "Alaska" , "American Samoa" , "Arizona" , "Arkansas" , "California" , "Colorado" , "Connecticut" , "Delaware" , "District of Columbia" , "Florida" , "Georgia" , "Guam" , "Hawaii" , "Idaho" , "Illinois" , "Indiana" , "Iowa" , "Kansas" , "Kentucky" , "Louisiana" , "Maine" , "Maryland" , "Massachusetts" , "Michigan" , "Minnesota" , "Mississippi" , "Missouri" , "Montana" , "Nebraska" , "Nevada" , "New Hampshire" , "New Jersey" , "New Mexico" , "New York" , "North Carolina" , "North Dakota" , "Northern Marianas Islands" , "Ohio" , "Oklahoma" , "Oregon" , "Pennsylvania" , "Puerto Rico" , "Rhode Island" , "South Carolina" , "South Dakota" , "Tennessee" , "Texas" , "Utah" , "Vermont" , "Virginia" , "Virgin Islands" , "Washington" , "West Virginia" , "Wisconsin" , "Wyoming" ] statesSet = fromList states main = mapM_ print (get statesSet "Burger Islands")
The output of this program is:
(0.7142857142857143,"Virgin Islands") (0.5714285714285714,"Rhode Island") (0.44,"Northern Marianas Islands") (0.35714285714285715,"Maryland")
Using the definition of statesSet
from previous example:
>>> get statesSet "Why-oh-me-ing" [(0.5384615384615384,"Wyoming")]
>>> get statesSet "Connect a cat" [(0.6923076923076923,"Connecticut")]
>>> get statesSet "Transylvania" [(0.75,"Pennsylvania"),(0.3333333333333333,"California"),(0.3333333333333333,"Arkansas"),(0.3333333333333333,"Kansas")]
>>> get statesSet "CanOfSauce" [(0.4,"Kansas")]
>>> get statesSet "Alaska" [(1.0,"Alaska")]
>>> get statesSet "Alaskanbraskansas" [(0.47058823529411764,"Arkansas"),(0.35294117647058826,"Kansas"),(0.35294117647058826,"Alaska"),(0.35294117647058826,"Alabama"),(0.35294117647058826,"Nebraska")]
Types
Opaque fuzzy string set data type. Use defaultSet
,
mkSet
, or fromList
to create FuzzySet
s.
API
Initializing
:: Size | The lower bound of gram sizes to use (inclusive) |
-> Size | The upper bound of gram sizes to use (inclusive) |
-> Bool | Whether to use Levenshtein distance to determine the score |
-> FuzzySet | An empty fuzzy string set |
Initialize a FuzzySet
.
A FuzzySet
with the following field values:
{ gramSizeLower = 2 , gramSizeUpper = 3 , useLevenshtein = True , exactSet = ε , matchDict = ε , items = ε }
fromList :: [Text] -> FuzzySet Source
Create a fuzzy string set with entries from the given list.
fromList = addMany defaultSet
Adding
Add an entry to the set, or do nothing if a key identical to the provided value already exists in the set.
:: FuzzySet | Fuzzy string set to add the entry to |
-> Text | The new entry |
-> (FuzzySet, Bool) | The updated set and a boolean, which will be |
Add an entry to the set and return a pair with the new set, and a boolean to indicate if a new entry was inserted, or not.
:: FuzzySet | Fuzzy string set to add the entries to |
-> [Text] | A list of new entries |
-> FuzzySet | A new fuzzy string set |
Add a list of entries to the set, in one go.
addMany = foldr (flip add)
Retrieving
:: FuzzySet | The fuzzy string set to compare the string against |
-> Text | The lookup query |
-> [(Double, Text)] | A list of results (score and matched value pairs) |
Try to match the given string against the entries in the set, using a minimum score of 0.33. Return a list of results ordered by similarity score, with the closest match first.
:: Double | A minimum score |
-> FuzzySet | The fuzzy string set to compare the string against |
-> Text | The lookup query |
-> [(Double, Text)] | A list of results (score and matched value pairs) |
Try to match the given string against the entries in the set, and return a list of all results with a score greater than or equal to the specified minimum score (i.e., the first argument). The results are ordered by similarity score, with the closest match first.
:: FuzzySet | The fuzzy string set to compare the string against |
-> Text | The lookup query |
-> Maybe Text |
|
Try to match the given string against the entries in the set, and return the closest match, if one is found.
Inspecting
size :: FuzzySet -> Int Source
Return the number of entries in the set.
>>>
size (defaultSet `add` "map" `add` "cap")
2