Portability | portable |
---|---|
Stability | experimental |
Maintainer | libraries@haskell.org |
Hieraclus is a library that supports clustering of arbitrary elements in haskell. The difference to the already existing cluster library hierarchical-clustering is the ability to work with abort criterias which allow an "intelligent" clustering. With the help of abort criterias the user can specify conditions that must be fulfilled in order to stop the clustering process.
Another motivation of creating this library was to make the cluster process run in O(n^2). However, the current implementation runs in O(n^2 * log n). It has to be mentioned that the real runtime complexity tends to grow faster due to memory management, I guess. Some profiling showed that there is quite a big amount of memory spent managing the maps. The principle idea was not to work with a matrix, but with two maps instead. The first map holds the mappings from cluster pairs to distances, the second map vice versa, thus allowing to find the minimal distance in O(log n) and not in O(n^2). Two make things more efficient the data to be clustered initially is transformed to vector space, as all clutering operations work in vector space. The actual clustering thus is done with the vector representations of the input data, which finally are transformed back.
The above mentioned information for the abort criterias, the maps and the element-mappings are carried through
the cluster process in a cluster state. So the actual cluster process takes place within the state monad.
However, the library offers a function cluster
that is purely functional as it returns a tuple.
First element of the tuple is the cluster result - simply implemented as list of list.
The second element of the tuple holds the cluster information used by the abort criterias.
- data ClusterState a b = CS {
- minmap :: MinimumMap a
- combis :: CombinationMap a
- cinfo :: ClusterInfo a b
- data ClusterInfo a b = CI {}
- type ClusterResult a = [[a]]
- newtype Cluster a = Cluster {}
- type ClusterMap a = IntMap (Cluster a)
- type ID = Key
- singleton :: Maybe (Vector a) -> Cluster a
- fromList :: [Vector a] -> ClusterMap a
- getCluster :: ClusterMap a -> ID -> Maybe (Cluster a)
- getClusterUnsafe :: ClusterMap a -> ID -> Cluster a
- mergeClusters :: ID -> ID -> ClusterMap a -> State (ClusterState a b) (Cluster a, ClusterMap a, ClusterMap a)
- extractClusterElements :: Ord a => ClusterMap a -> State (ClusterState a b) [[b]]
- type MinimumMap a = MultiSet (a, Pair ID)
- type CombinationMap a = Map (Pair ID) a
- type Pair a = (a, a)
- noAbort :: AbortCriterium a b
- maxTotal :: Ord a => a -> AbortCriterium a b
- nCluster :: Int -> AbortCriterium a b
- nSteps :: Int -> AbortCriterium a b
- calinski :: (Ord a, Floating a) => a -> AbortCriterium a b
- ellbow :: (Ord a, Num a, Floating a) => Int -> a -> AbortCriterium a b
- type DistanceFunction a = Vector a -> Vector a -> a
- type SimilarityFunction a = [Vector a] -> a
- singleLinkage :: (Ord a, Eq a) => DistanceFunction a -> ClusterFunction a
- completeLinkage :: (Ord a, Eq a) => DistanceFunction a -> ClusterFunction a
- averageLinkage :: (Ord a, Floating a) => DistanceFunction a -> ClusterFunction a
- wardLinkage :: Ord a => SimilarityFunction a -> ClusterFunction a
- pairwise :: Ord a => DistanceFunction a -> Cluster a -> Cluster a -> [a]
- clusterwise :: SimilarityFunction a -> ClusterFunction a
- addition :: Num a => CostFunction a
- varianceSum :: Floating a => CostFunction a
- type Transformation a b = a -> Vector b
- cluster :: (Ord a, Num a) => Transformation b a -> ClusterFunction a -> CostFunction a -> [AbortCriterium a b] -> [b] -> (ClusterResult b, ClusterInfo a b)
- runCluster :: (Ord a, Num a) => (b -> Vector a) -> ClusterFunction a -> CostFunction a -> [AbortCriterium a b] -> [b] -> State (ClusterState a b) (ClusterMap a)
Cluster State
data ClusterState a b Source
the cluster state contains information about all relevant maps that are needed for the clustering and information about the clustering process. The ClusterState is passed around withing the state monad
CS | |
|
(Show a, Show b) => Show (ClusterState a b) |
data ClusterInfo a b Source
the cluster process produces information about the clustering after each step. these information are given to functions that decide if the cluster process may continue or stop and return the results
CI | |
|
(Show a, Show b) => Show (ClusterInfo a b) |
type ClusterResult a = [[a]]Source
the resulting clusters are represented as a lists
Cluster Map
a Cluster is represented as a list of Vectors
type ClusterMap a = IntMap (Cluster a)Source
the Cluster map serves to represent unions of elements. Therefore it maps IDs to clusters.
fromList :: [Vector a] -> ClusterMap aSource
O(n) creates clusters by a given map
getCluster :: ClusterMap a -> ID -> Maybe (Cluster a)Source
getClusterUnsafe :: ClusterMap a -> ID -> Cluster aSource
mergeClusters :: ID -> ID -> ClusterMap a -> State (ClusterState a b) (Cluster a, ClusterMap a, ClusterMap a)Source
merge two clusters given by their ids and return a tuple. The first element of the tuple is the new created cluster. The second element is the new resulting cluster structure
extractClusterElements :: Ord a => ClusterMap a -> State (ClusterState a b) [[b]]Source
extracts the original values from the cluster map. It runs in the state monad as it needs the mapping of vectors to original values.
Minimum and Combination Map
type MinimumMap a = MultiSet (a, Pair ID)Source
the minimum map saves the distance matrix as a multi set, because a distance can occur more than one times. The set allows to find a distance pair by its ids and is used to find the minimum distance in O(log n) Note: Alternatively one could use kind of a binary heap to find the minimum distance in O(1) Storage complexity is O(n^2)
type CombinationMap a = Map (Pair ID) aSource
Like the minimum map but with the pairs as the keys, thus allowing to find the distance of a given pair in O(log n). Storage complexity is O(n^2)
a pair of ID is used for mappings from and to distances between two clusters.
Abort Criterias
noAbort :: AbortCriterium a bSource
no abortion means that the cluster process is only limited by its maximum number of possible steps that is: n - 1 where n is the number of elements to be clustered
maxTotal :: Ord a => a -> AbortCriterium a bSource
defines the max. "costs" of a further combining of two clusters. This can be the increase of the euclidean distance e.g. as well as the varianceSum
calinski :: (Ord a, Floating a) => a -> AbortCriterium a bSource
defines a tolerance for the homogeneity of the clusters that is the relation of the inner varianceSum of the recently created cluster and the outer varianceSum of all other clusters Developed by Calinski and Habarasz, see:
ellbow :: (Ord a, Num a, Floating a) => Int -> a -> AbortCriterium a bSource
calculates the ellbow criterium that is to find a cluster steps which costs are above average. The first parameter gives a number of steps that are tolerated as a kind of stabilization phase. So if minSteps is set to k than ellbow criterium starts calculation average at step k+1. The second parameter gives the max. allowed multiple of average inclination
Cluster Methods
type DistanceFunction a = Vector a -> Vector a -> aSource
a distance function determines how to calculate the distance between two vectors
type SimilarityFunction a = [Vector a] -> aSource
calculates the difference of two clusters by comparing them as a whole, e.g. the sum of variances of the clusters can be used
singleLinkage :: (Ord a, Eq a) => DistanceFunction a -> ClusterFunction aSource
O(n^2 log n). Uses the single linkage method for clustering
completeLinkage :: (Ord a, Eq a) => DistanceFunction a -> ClusterFunction aSource
O(n^2 log n). Uses the complete linkage method for clustering
averageLinkage :: (Ord a, Floating a) => DistanceFunction a -> ClusterFunction aSource
O(n^2 log n). Uses the average linkage method for clustering
wardLinkage :: Ord a => SimilarityFunction a -> ClusterFunction aSource
O(n^2 log n). Uses the ward linkage method for clustering
Cluster Method Construction
clusterwise :: SimilarityFunction a -> ClusterFunction aSource
Cost Functions
varianceSum :: Floating a => CostFunction aSource
Clustering Process
type Transformation a b = a -> Vector bSource
transforms the input data into a vector representation
cluster :: (Ord a, Num a) => Transformation b a -> ClusterFunction a -> CostFunction a -> [AbortCriterium a b] -> [b] -> (ClusterResult b, ClusterInfo a b)Source
runCluster :: (Ord a, Num a) => (b -> Vector a) -> ClusterFunction a -> CostFunction a -> [AbortCriterium a b] -> [b] -> State (ClusterState a b) (ClusterMap a)Source
a wrapper for the acutal clustering function running in the state monad receiving the needed parameters to transform them for it