kmeans-vector-0.3.2: An implementation of the kmeans clustering algorithm based on the vector package

Copyright (c) Alp Mestanogullari, Ville Tirronen, 2011-2015 BSD3 Alp Mestanogullari experimental None Haskell98

Math.KMeans

Description

An implementation of the k-means clustering algorithm based on the vector package.

The core functions of this module are `kmeans` and `kmeansWith`. See some examples on github.

Synopsis

# The meat of this package: `kmeans`

Arguments

 :: (a -> Vector Double) feature extraction -> Distance distance function -> Int the `k` to run `k`-means with (i.e number of desired clusters) -> [a] input list of `points` -> Clusters a result, hopefully `k` clusters of points

Run the kmeans clustering algorithm.

`kmeans f distance k points`

will run the algorithm using `f` to extract features from your type, using `distance` to measure the distance between vectors, trying to separate `points` in `k` clusters.

Extracting features just means getting a `Vector` with `Double` coordinates that will represent your type in the space in which `kmeans` will run.

Arguments

 :: Monad m => (Int -> [a] -> m (Clusters a)) how should we partition the points? -> (a -> Vector Double) get the coordinates of a "point" -> Distance what distance do we use -> Int number of desired clusters -> [a] list of points -> m (Clusters a) resulting clustering

Same as `kmeans`, except that instead of using `partition`, you supply your own function for choosing the initial clustering. Two important things to note:

• If you don't need any kind of effect and just have a `partition`-like function you want to use, `m` will can just be `Identity` here. If that's too obnoxious to work with, please let me know and I may just provide a separate `kmeansWith` function with no there. But most of the time, you'll probably just be interested in the following scenario.
• Most likely, you want to have something smarter than our simple `partition` function. A couple of papers I have read claim very decent results by using some precise probabilistic schemas for the initial partitionning. In this case, your `m` would probably be `IO` or `ST` (e.g using my probable package) and you could fine-tune the way the initial clusters are picked so that the algorithm may give better results. Of course, if your initialization is monadic, so is the result.

# Types

A distance on vectors

type Clusters a = Vector (Cluster a) Source

This is what `kmeans` hands you back. It's just a `Vector` of clusters that will hopefully be of length `k`.

newtype Cluster a Source

A `Cluster` of points is just a list of points

Constructors

 Cluster Fieldselements :: [a]elements that belong to that cluster

Instances

 Functor Cluster Foldable Cluster Traversable Cluster Eq a => Eq (Cluster a) Show a => Show (Cluster a) Monoid (Cluster a)

This type is used internally by `kmeans`. It represents our (hopefully) `k` centroids, obtained by computing the new centroids of a `Cluster`

# Misc.

partition :: Int -> [a] -> Clusters a Source

This is the current partitionning strategy used by `kmeans`. If we want `k` clusters, we just try to regroup consecutive elements in `k` buckets

The euclidean distance without taking the final square root This would waste cycles without changing the behavior of the algorithm

L1 distance of two vectors: d(v1, v2) = sum on i of |v1_i - v2_i|

L-inf distance of two vectors: d(v1, v2) = max |v1_i - v2_i]