Copyright | (c) 2017 Johannes Hildén |
---|---|
License | BSD3 |
Maintainer | hildenjohannes@gmail.com |
Stability | experimental |
Portability | GHC |
Safe Haskell | None |
Language | Haskell2010 |
A fuzzy string set data structure for approximate string matching. This implementation is based on the Python and JavaScript libraries with the same name:
- data FuzzySet
- type Size = Int
- mkSet :: Size -> Size -> Bool -> FuzzySet
- defaultSet :: FuzzySet
- fromList :: [Text] -> FuzzySet
- add :: FuzzySet -> Text -> FuzzySet
- addToSet :: FuzzySet -> Text -> (FuzzySet, Bool)
- addMany :: FuzzySet -> [Text] -> FuzzySet
- get :: FuzzySet -> Text -> [(Double, Text)]
- getWithMinScore :: Double -> FuzzySet -> Text -> [(Double, Text)]
- getOne :: FuzzySet -> Text -> Maybe Text
- size :: FuzzySet -> Int
- isEmpty :: FuzzySet -> Bool
- values :: FuzzySet -> [Text]
How to use
Make sure the OverloadedStrings
pragma is enabled. Then there are just
three steps:
- Create a set using one of
defaultSet
,mkSet
, orfromList
. - To add entries, use
add
,addToSet
, oraddMany
. - Then query the set with
get
,getOne
, orgetWithMinScore
.
>>>
defaultSet `add` "Jurassic Park" `add` "Terminator" `add` "The Matrix" `getOne` "percolator"
Just "Terminator"
>>>
defaultSet `add` "Shaggy Rogers" `add` "Fred Jones" `add` "Daphne Blake" `add` "Velma Dinkley" `get` "Shaggy Jones"
[(0.7692307692307693,"Shaggy Rogers"),(0.5,"Fred Jones")]
There are also a few functions to inspect the set.
More examples
{-# LANGUAGE OverloadedStrings #-} module Main where import Data.FuzzySet states = [ "Alabama" , "Alaska" , "American Samoa" , "Arizona" , "Arkansas" , "California" , "Colorado" , "Connecticut" , "Delaware" , "District of Columbia" , "Florida" , "Georgia" , "Guam" , "Hawaii" , "Idaho" , "Illinois" , "Indiana" , "Iowa" , "Kansas" , "Kentucky" , "Louisiana" , "Maine" , "Maryland" , "Massachusetts" , "Michigan" , "Minnesota" , "Mississippi" , "Missouri" , "Montana" , "Nebraska" , "Nevada" , "New Hampshire" , "New Jersey" , "New Mexico" , "New York" , "North Carolina" , "North Dakota" , "Northern Marianas Islands" , "Ohio" , "Oklahoma" , "Oregon" , "Pennsylvania" , "Puerto Rico" , "Rhode Island" , "South Carolina" , "South Dakota" , "Tennessee" , "Texas" , "Utah" , "Vermont" , "Virginia" , "Virgin Islands" , "Washington" , "West Virginia" , "Wisconsin" , "Wyoming" ] statesSet = fromList states main = mapM_ print (get statesSet "Burger Islands")
The output of this program is:
(0.7142857142857143,"Virgin Islands") (0.5714285714285714,"Rhode Island") (0.44,"Northern Marianas Islands") (0.35714285714285715,"Maryland")
Using the definition of statesSet
from previous example:
>>> get statesSet "Why-oh-me-ing" [(0.5384615384615384,"Wyoming")]
>>> get statesSet "Connect a cat" [(0.6923076923076923,"Connecticut")]
>>> get statesSet "Transylvania" [(0.75,"Pennsylvania"),(0.3333333333333333,"California"),(0.3333333333333333,"Arkansas"),(0.3333333333333333,"Kansas")]
>>> get statesSet "CanOfSauce" [(0.4,"Kansas")]
>>> get statesSet "Alaska" [(1.0,"Alaska")]
>>> get statesSet "Alaskanbraskansas" [(0.47058823529411764,"Arkansas"),(0.35294117647058826,"Kansas"),(0.35294117647058826,"Alaska"),(0.35294117647058826,"Alabama"),(0.35294117647058826,"Nebraska")]
Types
Opaque fuzzy string set data type. Use defaultSet
,
mkSet
, or fromList
to create FuzzySet
s.
API
Initializing
:: Size | The lower bound of gram sizes to use (inclusive) |
-> Size | The upper bound of gram sizes to use (inclusive) |
-> Bool | Whether to use Levenshtein distance to determine the score |
-> FuzzySet | An empty fuzzy string set |
Initialize a FuzzySet
.
defaultSet :: FuzzySet Source #
A FuzzySet
with the following field values:
{ gramSizeLower = 2 , gramSizeUpper = 3 , useLevenshtein = True , exactSet = ε , matchDict = ε , items = ε }
fromList :: [Text] -> FuzzySet Source #
Create a fuzzy string set with entries from the given list.
fromList = addMany defaultSet
Adding
Add an entry to the set, or do nothing if a key identical to the provided value already exists in the set.
:: FuzzySet | Fuzzy string set to add the entry to |
-> Text | The new entry |
-> (FuzzySet, Bool) | The updated set and a boolean, which will be |
Add an entry to the set and return a pair with the new set, and a boolean to indicate if a new entry was inserted, or not.
:: FuzzySet | Fuzzy string set to add the entries to |
-> [Text] | A list of new entries |
-> FuzzySet | A new fuzzy string set |
Add a list of entries to the set, in one go.
addMany = foldr (flip add)
Retrieving
:: FuzzySet | The fuzzy string set to compare the string against |
-> Text | The lookup query |
-> [(Double, Text)] | A list of results (score and matched value pairs) |
Try to match the given string against the entries in the set, using a minimum score of 0.33. Return a list of results ordered by similarity score, with the closest match first.
:: Double | A minimum score |
-> FuzzySet | The fuzzy string set to compare the string against |
-> Text | The lookup query |
-> [(Double, Text)] | A list of results (score and matched value pairs) |
Try to match the given string against the entries in the set, and return a list of all results with a score greater than or equal to the specified minimum score (i.e., the first argument). The results are ordered by similarity score, with the closest match first.
Try to match the given string against the entries in the set, and return the closest match, if one is found.
Inspecting
size :: FuzzySet -> Int Source #
Return the number of entries in the set.
>>>
size (defaultSet `add` "map" `add` "cap")
2