| Copyright | (c) 2017 Johannes Hildén |
|---|---|
| License | BSD3 |
| Maintainer | hildenjohannes@gmail.com |
| Stability | experimental |
| Portability | GHC |
| Safe Haskell | None |
| Language | Haskell2010 |
Data.FuzzySet
Description
A fuzzy string set data structure for approximate string matching. This implementation is based on the Python and JavaScript libraries with the same name:
- data FuzzySet
- type Size = Int
- mkSet :: Size -> Size -> Bool -> FuzzySet
- defaultSet :: FuzzySet
- fromList :: [Text] -> FuzzySet
- add :: FuzzySet -> Text -> FuzzySet
- addToSet :: FuzzySet -> Text -> (FuzzySet, Bool)
- addMany :: FuzzySet -> [Text] -> FuzzySet
- get :: FuzzySet -> Text -> [(Double, Text)]
- getWithMinScore :: Double -> FuzzySet -> Text -> [(Double, Text)]
- getOne :: FuzzySet -> Text -> Maybe Text
- size :: FuzzySet -> Int
- isEmpty :: FuzzySet -> Bool
- values :: FuzzySet -> [Text]
How to use
Make sure the OverloadedStrings pragma is enabled. Then there are just
three steps:
- Create a set using one of
defaultSet,mkSet, orfromList. - To add entries, use
add,addToSet, oraddMany. - Then query the set with
get,getOne, orgetWithMinScore.
>>>defaultSet `add` "Jurassic Park" `add` "Terminator" `add` "The Matrix" `getOne` "percolator"Just "Terminator"
>>>defaultSet `add` "Shaggy Rogers" `add` "Fred Jones" `add` "Daphne Blake" `add` "Velma Dinkley" `get` "Shaggy Jones"[(0.7692307692307693,"Shaggy Rogers"),(0.5,"Fred Jones")]
There are also a few functions to inspect the set.
More examples
{-# LANGUAGE OverloadedStrings #-}
module Main where
import Data.FuzzySet
states = [ "Alabama" , "Alaska" , "American Samoa" , "Arizona" , "Arkansas"
, "California" , "Colorado" , "Connecticut" , "Delaware" , "District of Columbia"
, "Florida" , "Georgia" , "Guam" , "Hawaii" , "Idaho"
, "Illinois" , "Indiana" , "Iowa" , "Kansas" , "Kentucky"
, "Louisiana" , "Maine" , "Maryland" , "Massachusetts" , "Michigan"
, "Minnesota" , "Mississippi" , "Missouri" , "Montana" , "Nebraska"
, "Nevada" , "New Hampshire" , "New Jersey" , "New Mexico" , "New York"
, "North Carolina" , "North Dakota" , "Northern Marianas Islands" , "Ohio" , "Oklahoma"
, "Oregon" , "Pennsylvania" , "Puerto Rico" , "Rhode Island" , "South Carolina"
, "South Dakota" , "Tennessee" , "Texas" , "Utah" , "Vermont"
, "Virginia" , "Virgin Islands" , "Washington" , "West Virginia" , "Wisconsin"
, "Wyoming" ]
statesSet = fromList states
main = mapM_ print (get statesSet "Burger Islands")The output of this program is:
(0.7142857142857143,"Virgin Islands") (0.5714285714285714,"Rhode Island") (0.44,"Northern Marianas Islands") (0.35714285714285715,"Maryland")
Using the definition of statesSet from previous example:
>>> get statesSet "Why-oh-me-ing" [(0.5384615384615384,"Wyoming")]
>>> get statesSet "Connect a cat" [(0.6923076923076923,"Connecticut")]
>>> get statesSet "Transylvania" [(0.75,"Pennsylvania"),(0.3333333333333333,"California"),(0.3333333333333333,"Arkansas"),(0.3333333333333333,"Kansas")]
>>> get statesSet "CanOfSauce" [(0.4,"Kansas")]
>>> get statesSet "Alaska" [(1.0,"Alaska")]
>>> get statesSet "Alaskanbraskansas" [(0.47058823529411764,"Arkansas"),(0.35294117647058826,"Kansas"),(0.35294117647058826,"Alaska"),(0.35294117647058826,"Alabama"),(0.35294117647058826,"Nebraska")]
Types
Opaque fuzzy string set data type. Use defaultSet,
mkSet, or fromList to create FuzzySets.
API
Initializing
Arguments
| :: Size | The lower bound of gram sizes to use (inclusive) |
| -> Size | The upper bound of gram sizes to use (inclusive) |
| -> Bool | Whether to use Levenshtein distance to determine the score |
| -> FuzzySet | An empty fuzzy string set |
Initialize a FuzzySet.
A FuzzySet with the following field values:
{ gramSizeLower = 2
, gramSizeUpper = 3
, useLevenshtein = True
, exactSet = ε
, matchDict = ε
, items = ε }fromList :: [Text] -> FuzzySet Source
Create a fuzzy string set with entries from the given list.
fromList = addMany defaultSet
Adding
Arguments
| :: FuzzySet | Fuzzy string set to add the entry to |
| -> Text | The new entry |
| -> FuzzySet | The updated set |
Add an entry to the set, or do nothing if a key identical to the provided value already exists in the set.
Arguments
| :: FuzzySet | Fuzzy string set to add the entry to |
| -> Text | The new entry |
| -> (FuzzySet, Bool) | The updated set and a boolean, which will be |
Add an entry to the set and return a pair with the new set, and a boolean to indicate if a new entry was inserted, or not.
Arguments
| :: FuzzySet | Fuzzy string set to add the entries to |
| -> [Text] | A list of new entries |
| -> FuzzySet | A new fuzzy string set |
Add a list of entries to the set, in one go.
addMany = foldr (flip add)
Retrieving
Arguments
| :: FuzzySet | The fuzzy string set to compare the string against |
| -> Text | The lookup query |
| -> [(Double, Text)] | A list of results (score and matched value pairs) |
Try to match the given string against the entries in the set, using a minimum score of 0.33. Return a list of results ordered by similarity score, with the closest match first.
Arguments
| :: Double | A minimum score |
| -> FuzzySet | The fuzzy string set to compare the string against |
| -> Text | The lookup query |
| -> [(Double, Text)] | A list of results (score and matched value pairs) |
Try to match the given string against the entries in the set, and return a list of all results with a score greater than or equal to the specified minimum score (i.e., the first argument). The results are ordered by similarity score, with the closest match first.
Arguments
| :: FuzzySet | The fuzzy string set to compare the string against |
| -> Text | The lookup query |
| -> Maybe Text |
|
Try to match the given string against the entries in the set, and return the closest match, if one is found.
Inspecting
size :: FuzzySet -> Int Source
Return the number of entries in the set.
>>>size (defaultSet `add` "map" `add` "cap")2