Copyright	(c) Alp Mestanogullari, Ville Tirronen, 2011-2015
License	BSD3
Maintainer	Alp Mestanogullari <alpmestan@gmail.com>
Stability	experimental
Safe Haskell	None
Language	Haskell98

Math.KMeans

Contents

The meat of this package: kmeans
Types
Misc.

Description

An implementation of the k-means clustering algorithm based on the vector package.

The core functions of this module are kmeans and kmeansWith. See some examples on github.

Synopsis

The meat of this package: `kmeans`

kmeans Source

Arguments

:: (a -> Vector Double)	feature extraction
-> Distance	distance function
-> Int	the `k` to run `k`-means with (i.e number of desired clusters)
-> [a]	input list of `points`
-> Clusters a	result, hopefully `k` clusters of points

Run the kmeans clustering algorithm.

kmeans f distance k points

will run the algorithm using f to extract features from your type, using distance to measure the distance between vectors, trying to separate points in k clusters.

Extracting features just means getting a Vector with Double coordinates that will represent your type in the space in which kmeans will run.

kmeansWith Source

Arguments

:: Monad m
=> (Int -> [a] -> m (Clusters a))	how should we partition the points?
-> (a -> Vector Double)	get the coordinates of a "point"
-> Distance	what distance do we use
-> Int	number of desired clusters
-> [a]	list of points
-> m (Clusters a)	resulting clustering

Same as kmeans, except that instead of using partition, you supply your own function for choosing the initial clustering. Two important things to note:

If you don't need any kind of effect and just have a partition-like function you want to use, m will can just be Identity here. If that's too obnoxious to work with, please let me know and I may just provide a separate kmeansWith function with no there. But most of the time, you'll probably just be interested in the following scenario.
Most likely, you want to have something smarter than our simple partition function. A couple of papers I have read claim very decent results by using some precise probabilistic schemas for the initial partitionning. In this case, your m would probably be IO or ST (e.g using my probable package) and you could fine-tune the way the initial clusters are picked so that the algorithm may give better results. Of course, if your initialization is monadic, so is the result.

Types

type Distance = Vector Double -> Vector Double -> Double Source

A distance on vectors

type Clusters a = Vector (Cluster a) Source

This is what kmeans hands you back. It's just a Vector of clusters that will hopefully be of length k.

newtype Cluster a Source

A Cluster of points is just a list of points

Constructors

Cluster
Fields elements :: [a] elements that belong to that cluster

Instances

Functor Cluster
Foldable Cluster
Traversable Cluster
Eq a => Eq (Cluster a)
Show a => Show (Cluster a)
Monoid (Cluster a)

type Centroids = Vector (Vector Double) Source

This type is used internally by kmeans. It represents our (hopefully) k centroids, obtained by computing the new centroids of a Cluster

Misc.

partition :: Int -> [a] -> Clusters a Source

This is the current partitionning strategy used by kmeans. If we want k clusters, we just try to regroup consecutive elements in k buckets

euclidSq :: Distance Source

The euclidean distance without taking the final square root This would waste cycles without changing the behavior of the algorithm

l1dist :: Distance Source

L1 distance of two vectors: d(v1, v2) = sum on i of |v1_i - v2_i|

linfdist :: Distance Source

L-inf distance of two vectors: d(v1, v2) = max |v1_i - v2_i]

The meat of this package: kmeans

Types

Misc.

The meat of this package: `kmeans`