Safe Haskell | None |
---|---|
Language | Haskell2010 |
Random projection trees for approximate nearest neighbor search in high-dimensional vector spaces.
Introduction
Similarity search is a common problem in many fields (imaging, natural language processing, ..), and is often one building block of a larger data processing system.
There are many ways to embed data in a vector space such that similarity search can be recast as a geometrical nearest neighbor lookup.
In turn, the efficiency and effectiveness of querying such a vector database strongly depends on how internally the data index is represented, graphs and trees being two common approaches.
The naive, all-pairs exact search becomes impractical even at moderate data sizes, which motivated research into approximate indexing methods.
Overview
This library provides a tree-based approach to approximate nearest neighbor search. The database is recursively partitioned according to a series of random projections, and this partitioning is logically arranged as a tree which allows for rapid lookup.
Internally, a single random projection vector is sampled per tree level, as proposed in [1]. The projection vectors in turn can be sparse with a tunable sparsity parameter, which can help compressing the database at a small accuracy cost.
Retrieval accuracy can be improved by populating multiple trees (i.e. a random forest), and intersecting the results of the same query against each of them.
Quick Start
1) Build an index with forest
2) Lookup the \(k\) nearest neighbors to a query point with knn
3) The database can be serialised and restored with serialiseRPForest
and deserialiseRPForest
, respectively.
References
1) Hyvonen, V., et al., Fast Nearest Neighbor Search through Sparse Random Projections and Voting, https://www.cs.helsinki.fi/u/ttonteri/pub/bigdata2016.pdf
Synopsis
- tree :: (Monad m, Inner SVector v) => Word64 -> Int -> Int -> Int -> Double -> Int -> ConduitT () (Embed v Double x) m () -> m (RPTree Double () (Vector (Embed v Double x)))
- forest :: (Monad m, Inner SVector v) => Word64 -> Int -> Int -> Int -> Int -> Double -> Int -> ConduitT () (Embed v Double x) m () -> m (RPForest Double (Vector (Embed v Double x)))
- rpTreeCfg :: Integral a => a -> Int -> RPTreeConfig
- data RPTreeConfig = RPCfg {}
- knn :: (Ord p, Inner SVector v, Unbox d, Real d) => (u d -> v d -> p) -> Int -> RPForest d (Vector (Embed u d x)) -> v d -> Vector (p, Embed u d x)
- serialiseRPForest :: (Serialise d, Serialise a, Unbox d) => RPForest d a -> [ByteString]
- deserialiseRPForest :: (Serialise d, Serialise a, Unbox d) => [ByteString] -> Either String (RPForest d a)
- recallWith :: (Inner SVector v, Unbox d, Fractional a1, Ord d, Ord a2, Ord x, Ord (u d), Num d) => (u d -> v d -> a2) -> RPForest d (Vector (Embed u d x)) -> Int -> v d -> a1
- leaves :: RPTree d l a -> [a]
- levels :: RPTree d l a -> Int
- points :: Monoid m => RPTree d l m -> m
- candidates :: (Inner SVector v, Unbox d, Ord d, Num d, Semigroup xs) => RPTree d l xs -> v d -> xs
- treeStats :: RPTree d l a -> RPTreeStats
- treeSize :: Foldable t => RPTree d l (t a) -> Int
- leafSizes :: Foldable t => RPTree d l (t a) -> RPT d l Int
- data RPTreeStats
- data Embed v e a = Embed {}
- data RPTree d l a
- type RPForest d a = IntMap (RPTree d () a)
- data SVector a
- fromListSv :: Unbox a => Int -> [(Int, a)] -> SVector a
- fromVectorSv :: Int -> Vector (Int, a) -> SVector a
- data DVector a
- fromListDv :: Unbox a => [a] -> DVector a
- fromVectorDv :: Vector a -> DVector a
- class (Scale u, Scale v) => Inner u v where
- class Scale v where
- innerSS :: (Vector u (Int, a), Vector v (Int, a), Unbox a, Num a) => u (Int, a) -> v (Int, a) -> a
- innerSD :: (Num a, Vector u (Int, a), Vector v a, Unbox a) => u (Int, a) -> v a -> a
- innerDD :: (Vector v a, Num a) => v a -> v a -> a
- metricSSL2 :: (Floating a, Vector u a, Unbox a, Vector u (Int, a), Vector v (Int, a)) => u (Int, a) -> v (Int, a) -> a
- metricSDL2 :: (Floating a, Unbox a, Vector u (Int, a), Vector v a) => u (Int, a) -> v a -> a
- scaleS :: (Vector v (a, b), Num b) => b -> v (a, b) -> v (a, b)
- scaleD :: (Vector v b, Num b) => b -> v b -> v b
- writeCsv :: (Foldable t, Unbox a, Show a, Show b) => FilePath -> t (Vector (DVector a, b)) -> IO ()
- writeDot :: Ord t => (t -> String) -> FilePath -> String -> RPTree d x t -> IO ()
- data BenchConfig = BenchConfig {
- bcDescription :: String
- bcMaxTreeDepth :: Int
- bcMinLeafSize :: Int
- bcNumTrees :: Int
- bcChunkSize :: Int
- bcNZDensity :: Double
- bcVectorDim :: Int
- bcDataSize :: Int
- bcNumQueryPoints :: Int
- normalSparse2 :: Monad m => Double -> Int -> GenT m (SVector Double)
- liftC :: (Monad m, MonadTrans t) => ConduitT i o m r -> ConduitT i o (t m) r
- randSeed :: MonadIO m => m Word64
- dataSource :: Monad m => Int -> GenT m a -> ConduitT i a (GenT m) ()
- datS :: Monad m => Int -> Int -> Double -> ConduitT i (SVector Double) (GenT m) ()
- datD :: Monad m => Int -> Int -> ConduitT i (DVector Double) (GenT m) ()
- sparse :: (Monad m, Unbox a) => Double -> Int -> GenT m a -> GenT m (SVector a)
- dense :: (Monad m, Vector Vector a) => Int -> GenT m a -> GenT m (DVector a)
- normal2 :: Monad m => GenT m (DVector Double)
- circle2d :: Monad m => Double -> GenT m (DVector Double)
Construction
:: (Monad m, Inner SVector v) | |
=> Word64 | random seed |
-> Int | max tree depth |
-> Int | min leaf size |
-> Int | data chunk size |
-> Double | nonzero density of projection vectors |
-> Int | dimension of projection vectors |
-> ConduitT () (Embed v Double x) m () | data source |
-> m (RPTree Double () (Vector (Embed v Double x))) |
Populate a tree from a data stream
Assumptions on the data source:
- non-empty : contains at least one value
- stationary : each chunk is representative of the whole dataset
- bounded : we wait until the end of the stream to produce a result
:: (Monad m, Inner SVector v) | |
=> Word64 | random seed |
-> Int | max tree depth, \(l > 1\) |
-> Int | min leaf size, \(m_{leaf} > 1\) |
-> Int | number of trees, \(n_t > 1\) |
-> Int | data chunk size, \(n_{chunk} > 3\) |
-> Double | nonzero density of projection vectors, \(p_{nz} \in (0, 1)\) |
-> Int | dimension of projection vectors, \(d > 1\) |
-> ConduitT () (Embed v Double x) m () | data source |
-> m (RPForest Double (Vector (Embed v Double x))) |
Populate a forest from a data stream
Assumptions on the data source:
- non-empty : contains at least one value
- stationary : each chunk is representative of the whole dataset
- bounded : we wait until the end of the stream to produce a result
Parameters
:: Integral a | |
=> a | data size |
-> Int | vector dimension |
-> RPTreeConfig |
Configure the rp-tree forest construction process with some natural defaults
data RPTreeConfig Source #
RPCfg | |
|
Instances
Show RPTreeConfig Source # | |
Defined in Data.RPTree.Conduit showsPrec :: Int -> RPTreeConfig -> ShowS # show :: RPTreeConfig -> String # showList :: [RPTreeConfig] -> ShowS # |
Query
:: (Ord p, Inner SVector v, Unbox d, Real d) | |
=> (u d -> v d -> p) | distance function |
-> Int | k neighbors |
-> RPForest d (Vector (Embed u d x)) | random projection forest |
-> v d | query point |
-> Vector (p, Embed u d x) | ordered in increasing distance order to the query point |
Look up the \(k\) nearest neighbors to a query point
The supplied distance function d
must satisfy the definition of a metric, i.e.
- identity of indiscernible elements : \( d(x, y) = 0 \leftrightarrow x \equiv y \)
- symmetry : \( d(x, y) = d(y, x) \)
- triangle inequality : \( d(x, y) + d(y, z) \geq d(x, z) \)
I/O
:: (Serialise d, Serialise a, Unbox d) | |
=> RPForest d a | |
-> [ByteString] | the order is undefined |
Serialise each tree in the RPForest
as a separate bytestring
:: (Serialise d, Serialise a, Unbox d) | |
=> [ByteString] | |
-> Either String (RPForest d a) | the order is undefined |
Deserialise each tree in the RPForest
from a separate bytestring and reconstruct
Statistics
:: (Inner SVector v, Unbox d, Fractional a1, Ord d, Ord a2, Ord x, Ord (u d), Num d) | |
=> (u d -> v d -> a2) | |
-> RPForest d (Vector (Embed u d x)) | |
-> Int | k : number of nearest neighbors to consider |
-> v d | query point |
-> a1 |
average recall-at-k, computed over a set of trees
Access
Retrieve points nearest to the query
in case of a narrow margin, collect both branches of the tree
Validation
treeStats :: RPTree d l a -> RPTreeStats Source #
treeSize :: Foldable t => RPTree d l (t a) -> Int Source #
How many data items are stored in the RPTree
leafSizes :: Foldable t => RPTree d l (t a) -> RPT d l Int Source #
How many data items are stored in each leaf of the RPTree
data RPTreeStats Source #
Instances
Eq RPTreeStats Source # | |
Defined in Data.RPTree (==) :: RPTreeStats -> RPTreeStats -> Bool # (/=) :: RPTreeStats -> RPTreeStats -> Bool # | |
Show RPTreeStats Source # | |
Defined in Data.RPTree showsPrec :: Int -> RPTreeStats -> ShowS # show :: RPTreeStats -> String # showList :: [RPTreeStats] -> ShowS # |
Types
Pairing of a data item with its vector embedding
The vector is used internally for indexing
Instances
Functor (Embed v e) Source # | |
(Eq a, Eq (v e)) => Eq (Embed v e a) Source # | |
(Ord a, Ord (v e)) => Ord (Embed v e a) Source # | |
Defined in Data.RPTree.Internal | |
(Show (v e), Show e, Show a) => Show (Embed v e a) Source # | |
Generic (Embed v e a) Source # | |
(NFData (v e), NFData a) => NFData (Embed v e a) Source # | |
Defined in Data.RPTree.Internal | |
(Serialise (v e), Serialise a) => Serialise (Embed v e a) Source # | |
type Rep (Embed v e a) Source # | |
Defined in Data.RPTree.Internal type Rep (Embed v e a) = D1 ('MetaData "Embed" "Data.RPTree.Internal" "rp-tree-0.4-252TfFRJZSoFJQVWZvE0KV" 'False) (C1 ('MetaCons "Embed" 'PrefixI 'True) (S1 ('MetaSel ('Just "eEmbed") 'NoSourceUnpackedness 'SourceStrict 'DecidedStrict) (Rec0 (v e)) :*: S1 ('MetaSel ('Just "eData") 'NoSourceUnpackedness 'SourceStrict 'DecidedStrict) (Rec0 a))) |
RPTree
Random projection trees
The first type parameter corresponds to a floating point scalar value, the second labels every tree part (e.g. for visual rendering) and the third is the type of the data collected at the leaves of the tree (e.g. lists of vectors).
We keep them separate to leverage the (Bi-)Functor instance for postprocessing and visualization.
This implementation uses one projection vector per tree level (as suggested in https://www.cs.helsinki.fi/u/ttonteri/pub/bigdata2016.pdf ).
Instances
Functor (RPTree d l) Source # | |
Foldable (RPTree d l) Source # | |
Defined in Data.RPTree.Internal fold :: Monoid m => RPTree d l m -> m # foldMap :: Monoid m => (a -> m) -> RPTree d l a -> m # foldMap' :: Monoid m => (a -> m) -> RPTree d l a -> m # foldr :: (a -> b -> b) -> b -> RPTree d l a -> b # foldr' :: (a -> b -> b) -> b -> RPTree d l a -> b # foldl :: (b -> a -> b) -> b -> RPTree d l a -> b # foldl' :: (b -> a -> b) -> b -> RPTree d l a -> b # foldr1 :: (a -> a -> a) -> RPTree d l a -> a # foldl1 :: (a -> a -> a) -> RPTree d l a -> a # toList :: RPTree d l a -> [a] # null :: RPTree d l a -> Bool # length :: RPTree d l a -> Int # elem :: Eq a => a -> RPTree d l a -> Bool # maximum :: Ord a => RPTree d l a -> a # minimum :: Ord a => RPTree d l a -> a # | |
Traversable (RPTree d l) Source # | |
Defined in Data.RPTree.Internal | |
(Unbox d, Eq d, Eq l, Eq a) => Eq (RPTree d l a) Source # | |
(Unbox d, Show d, Show l, Show a) => Show (RPTree d l a) Source # | |
Generic (RPTree d l a) Source # | |
(NFData a, NFData l, NFData d) => NFData (RPTree d l a) Source # | |
Defined in Data.RPTree.Internal | |
(Serialise d, Serialise l, Serialise a, Unbox d) => Serialise (RPTree d l a) Source # | |
type Rep (RPTree d l a) Source # | |
Defined in Data.RPTree.Internal |
type RPForest d a = IntMap (RPTree d () a) Source #
A random projection forest is an ordered set of RPTree
s
This supports efficient updates of the ensemble in the streaming/online setting.
Vector types
Sparse
Sparse vectors with unboxed components
Instances
Scale SVector Source # | |
Inner SVector Vector Source # | |
Defined in Data.RPTree.Internal | |
Inner SVector DVector Source # | |
Defined in Data.RPTree.Internal | |
Inner SVector SVector Source # | |
Defined in Data.RPTree.Internal | |
(Unbox a, Eq a) => Eq (SVector a) Source # | |
(Unbox a, Ord a) => Ord (SVector a) Source # | |
Defined in Data.RPTree.Internal | |
(Unbox a, Show a) => Show (SVector a) Source # | |
Generic (SVector a) Source # | |
NFData (SVector a) Source # | |
Defined in Data.RPTree.Internal | |
(Unbox a, Serialise a) => Serialise (SVector a) Source # | |
type Rep (SVector a) Source # | |
Defined in Data.RPTree.Internal type Rep (SVector a) = D1 ('MetaData "SVector" "Data.RPTree.Internal" "rp-tree-0.4-252TfFRJZSoFJQVWZvE0KV" 'False) (C1 ('MetaCons "SV" 'PrefixI 'True) (S1 ('MetaSel ('Just "svDim") 'SourceUnpack 'SourceStrict 'DecidedStrict) (Rec0 Int) :*: S1 ('MetaSel ('Just "svVec") 'NoSourceUnpackedness 'NoSourceStrictness 'DecidedLazy) (Rec0 (Vector (Int, a))))) |
fromListSv :: Unbox a => Int -> [(Int, a)] -> SVector a Source #
(Unsafe) Pack a SVector
from its vector dimension and components
Note : the relevant invariants are not checked :
- vector components are _assumed_ to be in increasing order
- vector dimension is larger than any component index
(Unsafe) Pack a SVector
from its vector dimension and components
Note : the relevant invariants are not checked :
- vector components are _assumed_ to be in increasing order
- vector dimension is larger than any component index
Dense
Dense vectors with unboxed components
Instances
Scale DVector Source # | |
Inner DVector DVector Source # | |
Defined in Data.RPTree.Internal | |
Inner SVector DVector Source # | |
Defined in Data.RPTree.Internal | |
(Unbox a, Eq a) => Eq (DVector a) Source # | |
(Unbox a, Ord a) => Ord (DVector a) Source # | |
Defined in Data.RPTree.Internal | |
(Unbox a, Show a) => Show (DVector a) Source # | |
Generic (DVector a) Source # | |
NFData (DVector a) Source # | |
Defined in Data.RPTree.Internal | |
(Unbox a, Serialise a) => Serialise (DVector a) Source # | |
type Rep (DVector a) Source # | |
Defined in Data.RPTree.Internal |
fromListDv :: Unbox a => [a] -> DVector a Source #
fromVectorDv :: Vector a -> DVector a Source #
Vector space typeclasses
class (Scale u, Scale v) => Inner u v where Source #
Inner product spaces
This typeclass is provided as a convenience for library users to interface their own vector types.
inner :: (Unbox a, Num a) => u a -> v a -> a Source #
metricL2 :: (Unbox a, Floating a) => u a -> v a -> a Source #
Helpers for implementing Inner
instances
Inner product
innerSS :: (Vector u (Int, a), Vector v (Int, a), Unbox a, Num a) => u (Int, a) -> v (Int, a) -> a Source #
sparse-sparse inner product
innerSD :: (Num a, Vector u (Int, a), Vector v a, Unbox a) => u (Int, a) -> v a -> a Source #
sparse-dense inner product
L2 distance
metricSSL2 :: (Floating a, Vector u a, Unbox a, Vector u (Int, a), Vector v (Int, a)) => u (Int, a) -> v (Int, a) -> a Source #
Vector distance induced by the L2 norm (sparse-sparse)
metricSDL2 :: (Floating a, Unbox a, Vector u (Int, a), Vector v a) => u (Int, a) -> v a -> a Source #
Vector distance induced by the L2 norm (sparse-dense)
Scale
Rendering
CSV
:: (Foldable t, Unbox a, Show a, Show b) | |
=> FilePath | path of output file |
-> t (Vector (DVector a, b)) | data point, label |
-> IO () |
Encode dataset as CSV and save into file
GraphViz dot
:: Ord t | |
=> (t -> String) | how to render the node content |
-> FilePath | path of output file |
-> String | graph name |
-> RPTree d x t | |
-> IO () |
tree to graphviz dot format
Testing
data BenchConfig Source #
BenchConfig | |
|
Instances
Show BenchConfig Source # | |
Defined in Data.RPTree.Internal.Testing showsPrec :: Int -> BenchConfig -> ShowS # show :: BenchConfig -> String # showList :: [BenchConfig] -> ShowS # |
Random generation
Conduit
:: Monad m | |
=> Int | number of vectors to generate |
-> GenT m a | random generator for the vector components |
-> ConduitT i a (GenT m) () |
Source of random data points
:: Monad m | |
=> Int | number of data points |
-> Int | vector dimension |
-> Double | nonzero density |
-> ConduitT i (SVector Double) (GenT m) () |
binary mixture of isotropic Gaussian rvs with sparse components
:: Monad m | |
=> Int | number of data points |
-> Int | vector dimension |
-> ConduitT i (DVector Double) (GenT m) () |
binary mixture of isotropic Gaussian rvs
Vector data
:: (Monad m, Unbox a) | |
=> Double | nonzero density |
-> Int | vector dimension |
-> GenT m a | random generator of vector components |
-> GenT m (SVector a) |
Generate a sparse random vector with a given nonzero density and components sampled from the supplied random generator