kmeans-vector-0.3.2: An implementation of the kmeans clustering algorithm based on the vector package

Copyright (c) Alp Mestanogullari, Ville Tirronen, 2011-2015 BSD3 Alp Mestanogullari experimental None Haskell98

Math.KMeans

Description

An implementation of the k-means clustering algorithm based on the vector package.

The core functions of this module are kmeans and kmeansWith. See some examples on github.

Synopsis

# The meat of this package: kmeans

Arguments

 :: (a -> Vector Double) feature extraction -> Distance distance function -> Int the k to run k-means with (i.e number of desired clusters) -> [a] input list of points -> Clusters a result, hopefully k clusters of points

Run the kmeans clustering algorithm.

kmeans f distance k points

will run the algorithm using f to extract features from your type, using distance to measure the distance between vectors, trying to separate points in k clusters.

Extracting features just means getting a Vector with Double coordinates that will represent your type in the space in which kmeans will run.

Arguments

 :: Monad m => (Int -> [a] -> m (Clusters a)) how should we partition the points? -> (a -> Vector Double) get the coordinates of a "point" -> Distance what distance do we use -> Int number of desired clusters -> [a] list of points -> m (Clusters a) resulting clustering

Same as kmeans, except that instead of using partition, you supply your own function for choosing the initial clustering. Two important things to note:

• If you don't need any kind of effect and just have a partition-like function you want to use, m will can just be Identity here. If that's too obnoxious to work with, please let me know and I may just provide a separate kmeansWith function with no there. But most of the time, you'll probably just be interested in the following scenario.
• Most likely, you want to have something smarter than our simple partition function. A couple of papers I have read claim very decent results by using some precise probabilistic schemas for the initial partitionning. In this case, your m would probably be IO or ST (e.g using my probable package) and you could fine-tune the way the initial clusters are picked so that the algorithm may give better results. Of course, if your initialization is monadic, so is the result.

# Types

A distance on vectors

type Clusters a = Vector (Cluster a) Source

This is what kmeans hands you back. It's just a Vector of clusters that will hopefully be of length k.

newtype Cluster a Source

A Cluster of points is just a list of points

Constructors

 Cluster Fieldselements :: [a]elements that belong to that cluster

Instances

 Functor Cluster Foldable Cluster Traversable Cluster Eq a => Eq (Cluster a) Show a => Show (Cluster a) Monoid (Cluster a)

This type is used internally by kmeans. It represents our (hopefully) k centroids, obtained by computing the new centroids of a Cluster

# Misc.

partition :: Int -> [a] -> Clusters a Source

This is the current partitionning strategy used by kmeans. If we want k clusters, we just try to regroup consecutive elements in k buckets

The euclidean distance without taking the final square root This would waste cycles without changing the behavior of the algorithm

L1 distance of two vectors: d(v1, v2) = sum on i of |v1_i - v2_i|

L-inf distance of two vectors: d(v1, v2) = max |v1_i - v2_i]