hieraclus- Automated clustering of arbitrary elements in Haskell.





Hieraclus is a library that supports clustering of arbitrary elements in haskell. The difference to the already existing cluster library hierarchical-clustering is the ability to work with abort criterias which allow an "intelligent" clustering. With the help of abort criterias the user can specify conditions that must be fulfilled in order to stop the clustering process.

Another motivation of creating this library was to make the cluster process run in O(n^2). However, the current implementation runs in O(n^2 * log n). It has to be mentioned that the real runtime complexity tends to grow faster due to memory management, I guess. Some profiling showed that there is quite a big amount of memory spent managing the maps. The principle idea was not to work with a matrix, but with two maps instead. The first map holds the mappings from cluster pairs to distances, the second map vice versa, thus allowing to find the minimal distance in O(log n) and not in O(n^2). Two make things more efficient the data to be clustered initially is transformed to vector space, as all clutering operations work in vector space. The actual clustering thus is done with the vector representations of the input data, which finally are transformed back.

The above mentioned information for the abort criterias, the maps and the element-mappings are carried through the cluster process in a cluster state. So the actual cluster process takes place within the state monad. However, the library offers a function cluster that is purely functional as it returns a tuple. First element of the tuple is the cluster result - simply implemented as list of list. The second element of the tuple holds the cluster information used by the abort criterias.


Cluster State

data ClusterState a b Source

the cluster state contains information about all relevant maps that are needed for the clustering and information about the clustering process. The ClusterState is passed around withing the state monad




minmap :: MinimumMap a

holds the mappings from distances to pairs

combis :: CombinationMap a

holds the mappings from pairs to distances

cinfo :: ClusterInfo a b

holds information of the clustering process that is needed by the Abort Criterias


(Show a, Show b) => Show (ClusterState a b) 

data ClusterInfo a b Source

the cluster process produces information about the clustering after each step. these information are given to functions that decide if the cluster process may continue or stop and return the results




idents :: Map (Vector a) b

holds the mapping from the representation vectors to its actual objects

nElems :: Int

the number of elements to be clustered

cNew :: (Cluster a, [Cluster a])

the new created cluster and the all other clusters

costs :: a

a quality factor of the current combining that indicates the "costs" of cNew

total :: a

the accmulated costs

cStep :: Int

the current clustering step

cHistory :: [a]

holds a history of all costs


(Show a, Show b) => Show (ClusterInfo a b) 

type ClusterResult a = [[a]]Source

the resulting clusters are represented as a lists

Cluster Map

newtype Cluster a Source

a Cluster is represented as a list of Vectors




vals :: [Vector a]


Show a => Show (Cluster a) 

type ClusterMap a = IntMap (Cluster a)Source

the Cluster map serves to represent unions of elements. Therefore it maps IDs to clusters.

type ID = KeySource

Unique ID for a cluster

singleton :: Maybe (Vector a) -> Cluster aSource

O(1) creates a cluster with only one element

fromList :: [Vector a] -> ClusterMap aSource

O(n) creates clusters by a given map

mergeClusters :: ID -> ID -> ClusterMap a -> State (ClusterState a b) (Cluster a, ClusterMap a, ClusterMap a)Source

merge two clusters given by their ids and return a tuple. The first element of the tuple is the new created cluster. The second element is the new resulting cluster structure

extractClusterElements :: Ord a => ClusterMap a -> State (ClusterState a b) [[b]]Source

extracts the original values from the cluster map. It runs in the state monad as it needs the mapping of vectors to original values.

Minimum and Combination Map

type MinimumMap a = MultiSet (a, Pair ID)Source

the minimum map saves the distance matrix as a multi set, because a distance can occur more than one times. The set allows to find a distance pair by its ids and is used to find the minimum distance in O(log n) Note: Alternatively one could use kind of a binary heap to find the minimum distance in O(1) Storage complexity is O(n^2)

type CombinationMap a = Map (Pair ID) aSource

Like the minimum map but with the pairs as the keys, thus allowing to find the distance of a given pair in O(log n). Storage complexity is O(n^2)

type Pair a = (a, a)Source

a pair of ID is used for mappings from and to distances between two clusters.

Abort Criterias

noAbort :: AbortCriterium a bSource

no abortion means that the cluster process is only limited by its maximum number of possible steps that is: n - 1 where n is the number of elements to be clustered

maxTotal :: Ord a => a -> AbortCriterium a bSource

defines the max. "costs" of a further combining of two clusters. This can be the increase of the euclidean distance e.g. as well as the varianceSum

nCluster :: Int -> AbortCriterium a bSource

sets a max. number of clusters

nSteps :: Int -> AbortCriterium a bSource

sets a number of steps that has to be done

calinski :: (Ord a, Floating a) => a -> AbortCriterium a bSource

defines a tolerance for the homogeneity of the clusters that is the relation of the inner varianceSum of the recently created cluster and the outer varianceSum of all other clusters Developed by Calinski and Habarasz, see:

ellbow :: (Ord a, Num a, Floating a) => Int -> a -> AbortCriterium a bSource

calculates the ellbow criterium that is to find a cluster steps which costs are above average. The first parameter gives a number of steps that are tolerated as a kind of stabilization phase. So if minSteps is set to k than ellbow criterium starts calculation average at step k+1. The second parameter gives the max. allowed multiple of average inclination

Cluster Methods

type DistanceFunction a = Vector a -> Vector a -> aSource

a distance function determines how to calculate the distance between two vectors

type SimilarityFunction a = [Vector a] -> aSource

calculates the difference of two clusters by comparing them as a whole, e.g. the sum of variances of the clusters can be used

singleLinkage :: (Ord a, Eq a) => DistanceFunction a -> ClusterFunction aSource

O(n^2 log n). Uses the single linkage method for clustering

completeLinkage :: (Ord a, Eq a) => DistanceFunction a -> ClusterFunction aSource

O(n^2 log n). Uses the complete linkage method for clustering

averageLinkage :: (Ord a, Floating a) => DistanceFunction a -> ClusterFunction aSource

O(n^2 log n). Uses the average linkage method for clustering

wardLinkage :: Ord a => SimilarityFunction a -> ClusterFunction aSource

O(n^2 log n). Uses the ward linkage method for clustering

Cluster Method Construction

clusterwise :: SimilarityFunction a -> ClusterFunction aSource

Cost Functions

addition :: Num a => CostFunction aSource

varianceSum :: Floating a => CostFunction aSource

Clustering Process

type Transformation a b = a -> Vector bSource

transforms the input data into a vector representation

cluster :: (Ord a, Num a) => Transformation b a -> ClusterFunction a -> CostFunction a -> [AbortCriterium a b] -> [b] -> (ClusterResult b, ClusterInfo a b)Source

runCluster :: (Ord a, Num a) => (b -> Vector a) -> ClusterFunction a -> CostFunction a -> [AbortCriterium a b] -> [b] -> State (ClusterState a b) (ClusterMap a)Source

a wrapper for the acutal clustering function running in the state monad receiving the needed parameters to transform them for it