Copyright | (c) Alp Mestanogullari, Ville Tirronen, 2011-2015 |
---|---|

License | BSD3 |

Maintainer | Alp Mestanogullari <alpmestan@gmail.com> |

Stability | experimental |

Safe Haskell | None |

Language | Haskell98 |

An implementation of the k-means clustering algorithm based on the vector package.

The core functions of this module are `kmeans`

and `kmeansWith`

. See some examples
on github.

- kmeans :: (a -> Vector Double) -> Distance -> Int -> [a] -> Clusters a
- kmeansWith :: Monad m => (Int -> [a] -> m (Clusters a)) -> (a -> Vector Double) -> Distance -> Int -> [a] -> m (Clusters a)
- type Distance = Vector Double -> Vector Double -> Double
- type Clusters a = Vector (Cluster a)
- newtype Cluster a = Cluster {
- elements :: [a]

- type Centroids = Vector (Vector Double)
- partition :: Int -> [a] -> Clusters a
- euclidSq :: Distance
- l1dist :: Distance
- linfdist :: Distance

# The meat of this package: `kmeans`

:: (a -> Vector Double) | feature extraction |

-> Distance | distance function |

-> Int | the |

-> [a] | input list of |

-> Clusters a | result, hopefully |

Run the kmeans clustering algorithm.

kmeans f distance k points

will run the algorithm using `f`

to extract features from your type,
using `distance`

to measure the distance between vectors,
trying to separate `points`

in `k`

clusters.

Extracting features just means getting a `Vector`

with `Double`

coordinates that will represent your type
in the space in which `kmeans`

will run.

:: Monad m | |

=> (Int -> [a] -> m (Clusters a)) | how should we partition the points? |

-> (a -> Vector Double) | get the coordinates of a "point" |

-> Distance | what distance do we use |

-> Int | number of desired clusters |

-> [a] | list of points |

-> m (Clusters a) | resulting clustering |

Same as `kmeans`

, except that instead of using `partition`

, you supply your own
function for choosing the initial clustering. Two important things to note:

- If you don't need any kind of effect and just have a
`partition`

-like function you want to use,`m`

will can just be`Identity`

here. If that's too obnoxious to work with, please let me know and I may just provide a separate`kmeansWith`

function with no there. But most of the time, you'll probably just be interested in the following scenario. - Most likely, you want to have something smarter than our simple
`partition`

function. A couple of papers I have read claim very decent results by using some precise probabilistic schemas for the initial partitionning. In this case, your`m`

would probably be`IO`

or`ST`

(e.g using my probable package) and you could fine-tune the way the initial clusters are picked so that the algorithm may give better results. Of course, if your initialization is monadic, so is the result.

# Types

A `Cluster`

of points is just a list of points

# Misc.

partition :: Int -> [a] -> Clusters a Source

This is the current partitionning strategy used
by `kmeans`

. If we want `k`

clusters, we just
try to regroup consecutive elements in `k`

buckets

The euclidean distance without taking the final square root This would waste cycles without changing the behavior of the algorithm