kmeans-vector-0.3.2: An implementation of the kmeans clustering algorithm based on the vector package

Copyright(c) Alp Mestanogullari, Ville Tirronen, 2011-2015
LicenseBSD3
MaintainerAlp Mestanogullari <alpmestan@gmail.com>
Stabilityexperimental
Safe HaskellNone
LanguageHaskell98

Math.KMeans

Contents

Description

An implementation of the k-means clustering algorithm based on the vector package.

The core functions of this module are kmeans and kmeansWith. See some examples on github.

Synopsis

The meat of this package: kmeans

kmeans Source

Arguments

:: (a -> Vector Double)

feature extraction

-> Distance

distance function

-> Int

the k to run k-means with (i.e number of desired clusters)

-> [a]

input list of points

-> Clusters a

result, hopefully k clusters of points

Run the kmeans clustering algorithm.

kmeans f distance k points

will run the algorithm using f to extract features from your type, using distance to measure the distance between vectors, trying to separate points in k clusters.

Extracting features just means getting a Vector with Double coordinates that will represent your type in the space in which kmeans will run.

kmeansWith Source

Arguments

:: Monad m 
=> (Int -> [a] -> m (Clusters a))

how should we partition the points?

-> (a -> Vector Double)

get the coordinates of a "point"

-> Distance

what distance do we use

-> Int

number of desired clusters

-> [a]

list of points

-> m (Clusters a)

resulting clustering

Same as kmeans, except that instead of using partition, you supply your own function for choosing the initial clustering. Two important things to note:

  • If you don't need any kind of effect and just have a partition-like function you want to use, m will can just be Identity here. If that's too obnoxious to work with, please let me know and I may just provide a separate kmeansWith function with no there. But most of the time, you'll probably just be interested in the following scenario.
  • Most likely, you want to have something smarter than our simple partition function. A couple of papers I have read claim very decent results by using some precise probabilistic schemas for the initial partitionning. In this case, your m would probably be IO or ST (e.g using my probable package) and you could fine-tune the way the initial clusters are picked so that the algorithm may give better results. Of course, if your initialization is monadic, so is the result.

Types

type Distance = Vector Double -> Vector Double -> Double Source

A distance on vectors

type Clusters a = Vector (Cluster a) Source

This is what kmeans hands you back. It's just a Vector of clusters that will hopefully be of length k.

newtype Cluster a Source

A Cluster of points is just a list of points

Constructors

Cluster 

Fields

elements :: [a]

elements that belong to that cluster

type Centroids = Vector (Vector Double) Source

This type is used internally by kmeans. It represents our (hopefully) k centroids, obtained by computing the new centroids of a Cluster

Misc.

partition :: Int -> [a] -> Clusters a Source

This is the current partitionning strategy used by kmeans. If we want k clusters, we just try to regroup consecutive elements in k buckets

euclidSq :: Distance Source

The euclidean distance without taking the final square root This would waste cycles without changing the behavior of the algorithm

l1dist :: Distance Source

L1 distance of two vectors: d(v1, v2) = sum on i of |v1_i - v2_i|

linfdist :: Distance Source

L-inf distance of two vectors: d(v1, v2) = max |v1_i - v2_i]