hsgsom-0.2.0: An implementation of the GSOM clustering algorithm.

Portability non-portable (requires STM) experimental gnn.github@gmail.com

Data.Datamining.Clustering.Gsom

Description

This module should contain everything you need to run the GSOM clustering algorithm. It collects and re-exports all important and needed functions from moduls lower in the hirarchy.

Ideally you should never need to look at those modules. If you do need to do this, it is a design failure and I would appreciate it if you would drop me a note.

Synopsis

# Algorithm Input

The GSOM algorithm expects a list of input vectors as its input. These input vectors are just lists of `Double`s.

type Input = [Double]Source

Input vectors are represented as lists of Doubles.

## Input processing

Calculating the dimension of a given set of inputs just means finding the length of the longest input vector.

type Bounds = [(Double, Double)]Source

The bounds of a list of inputs. Having the tuple `(a,b)` at index `i` in `bounds` means that the value at index `i` of each of the input vectors from the inputs which where used to calculate `bounds` is from the intervall `[a,b]`.

Calculates the bounds of the input vector components.

Normalizes input vectors. `normalize inputs` takes the given list of input vectors `inputs` and returns a list of input vectors where each component is in `[0,1]`. If you want to unnormalize the input vectors use `bounds` and `unnormalize`.

Unnormalizes the given input vectors `inputs` assuming that it's bounds previously where `bounds`.

## Operations between input vectors

`distance i1 i2` calculates the euclidean distance between `i1` and `i2`. If `i1` and `i2` have different lengths, excess values are ignored.

Multiplication of an input vector with a scalar.

Subtraction and addition of vectors between each other.

# The map created by GSOM

The growing self organizing map builds, as its name suggests, a map representing the input vectors. This map is of type `Lattice` and is a network of `Node`s.

## The `Node`s of the map

data Node Source

The type of nodes of a gsom.

Constructors

 Leaf They're either Leafs, signalling neighbours of boundary nodes Node or they are actual nodes with a few associated values and a list of neighbouring nodes. Fieldslocation :: CoordinatesUsed to uniquely identify nodes. This is also the actual location of the node if the lattice it belongs to is beeing laid out in the two dimensional plane and it is used to store the node in the map comprising the lattice. quantizationError :: TVar DoubleThe quantization error the node has accumulated so far. weights :: TVar InputThe node's weight vector. This is the center of the voronoi cell the node represents. neighbours :: NeighboursThe node's neighbours.

Instances

 Eq Node

type Nodes = [Node]Source

type Neighbours = [TVar Node]Source

A node's neighbours are stored in fields of type `Neighbours`.

type Neighbourhood = [(Int, Node)]Source

The type of neighbourhoods. Wherever a neighbourhood of a node is neede, this type should be used. A `Neighbourhood` consits of a list of pairs of nodes and their discrete grid distance from the source of the neighbourhood. The source node is the only one with distance `0` while immediate neighbours get distance one and so on.

### Node Creation

`node id weights neighbours` creates a node with the specified parameters and initial quantization error of `0`.

### Modification

`propagate node` propagates the accumulated error of the given `node` to it's neighbours.

update :: Input -> Double -> (Int -> Double) -> (Int, Node) -> STM ()Source

`update input learning_rate kernel neighbour` updates the weights of the `neighbour` according to the formula:

• `weight -> weight + learning_rate * (kernel d) (input - weight)`

`updateError node input` updates the `quantizationError` of `node`. The new error is just the old error plus the distance of the `node`'s weight vector from `input`.

### Querying

`isLeaf node` returns `True` if the given node is a `Leaf` and `False` otherwise.

`isNode node` returns `False` if the given node is a `Leaf` and `True` otherwise.

Calculates the neighbourhood of the given size of the given node. A neighbourhood size of `0` means that only the given node will be an element of the returned set while a size of one will return the given node and it's immediate neighbours and so on. It's not very efficient so you shouldn't try big neihbourhood sizes. The returned neighbourhood always includes `node`.

`unwrappedNeighbours node` returns the list of neighbours of the given `node`. Note that neighbours is unwrapped, i.e. the returned list hast type `Nodes` not `TVar Nodes`.

## The map type

The lattice type. Since global access to nodes is needed they're stored in a `Data.Map` indexed by their coordinates.

### Creating initial lattices

`newNormalized dimension` creates a new minimal lattice where weights are initialized with all components having the value `0.5` the and with the weight vectors having length `dimension`.

newRandom :: RandomGen g => g -> Int -> IO LatticeSource

`newRandom g dimension` creates a new minimal lattice where weights are randomly initialized with values between 0 and 1 using the random number generator `g` and with the weight vectors having the specified `dimension`.

### Querying a lattice

`bmu input lattice` returns the best matching unit i.e. the node with minimal distance to the given input vector.

Returns the nodes stored in lattice.

### Debugging and Output

Dumps the given input vectors to a string which can be fed to gnuplot. Just write the string to a file and write `plot "file"` in gnuplot.

`renderScript c path` expects to be given a `Clustering` `c` having 2 dimensional `center`s and will call `error` if that's not the case. On success it will save a python script to `path`.py. If this python script is run it will in turn save a PDF image to `path`.pdf. The image will contain the graph induced by `c` with each node (cluster center) placed positioned according to the `c`'s center (weight vector). The python script will depend on the `networkx` and `mathplotlib` python packages being installed. I know that this is relatively clunky, but since I haven't found a better way of creating an image of a graph with known node positions, this is the way I chose to go.

# Running GSOM

The GSOM algorithm builds the map by sequentially `run`ning a given number of `Phases` over the input.

## The Phase type

data Phase Source

This datatype encapsulates all the parameters needed to be known to run one phase of the GSOM algorithm.

Constructors

 Phase Fieldspasses :: IntThe number of passes which are to be made over the input vectors. Since each step of the phase consumes one input vector, the overall number of steps the phase will have will be: `steps = passes * length inputs`neighbourhoodSize :: IntThe initial size of the neighbourhood affected by weight adaption. This will decrease linearly while the phase is executed. learningRate :: LearningRateThe function used to calculate the learning rate in each of the phase's steps. During each step `learninRate currentStep maxSteps` is fed the number (starting from zero) of the current step as the first argument and the total number of steps the phase will have as the second argument to calculate the learning rate which will be in effect for this phase. kernel :: KernelThe kernel function. It is used in conjunction with the learning rate to adjust weight adaption. `kernel currentNeighbourhoodsize gridDistance` should take the neighbourhood size which is in effect during the current step and a nodes grid distance from the winning node. The neighbourhood size will be a real number due to the linear decrease. grow :: BoolThe `grow` flag determines whether this is a growing phase or not. If this is `False` then no new nodes will be grown. spreadFactor :: DoubleThe spread factor is used to calculate the growth threshold according to the formula: `GT = - sqrt(`d`)*ln(`spreadFactor`)`where `d` is the input dimension.

Instances

data Kernel Source

Constructors

 Bubble The bubble kernel is essentially the identity, i.e. it has no effect. Gaussian Let `s` be the neighbourhood size currently in effect and `d` be the grid-distance of the current node to the winning node then this kernel calculates a factor to control weight adaption with the following formula: `gaussian s d = exp(d^2/(2*s^2))`

Instances

 Enum Kernel Eq Kernel Ord Kernel Read Kernel Show Kernel

Constructors

 Linear Double The linear learning rate reduction function. If you supply it with the initial learning rate `lr` it uses the following formula where `step` is the current step the phase is in and `steps` is the overall number of steps the phase will take: `linear lr step steps = lr * (1-step/steps)` InverseAge Double The inverse time learning rate reduction function. Given an initial learning rate of `lr`, a maximum number of steps of `steps` and the current step number beeing `step`, the formula is: `inverseAge lr step steps = lr * steps / (steps + 100 * step)`

Instances

### Predefined phases

The default first phase is the only growing phase. It makes 5 passes over the input, uses an initial learning rate of 0.1 and a starting neighbourhood size of 3. The `spreadFactor` is set to 0.1.

The default for the second phase is a smoothing phase making 50 passes over the input vectors with a learning rate of 0.05 and an initial neighbourhood size of 2. Since there is no node growth the `spreadFactor` is ignored and thus set to 0.

The default for the third and last phase is a smoothing phase making 50 passes over the input vectors with a learning rate of 0.01 and an initial neighbourhood size of 1. Since there is no node growth the `spreadFactor` is ignored and thus set to 0.

This is the list of the three default phases of the GSOM algorithm.

### Running a phase/phases

`phase parameters inputs` will update the given `lattice` by executing one phase of the GSOM algorithm with the given `inputs` and `parameters`.

Since a complete run of the GSOM algorithm means running a number of `Phases` this is usually the main function used. `run phases lattice inputs` runs the GSOM algorithm by running the `phases` in the order specified, each time making passes over `inputs` and using the produced `Lattice` to as an argument to the next phase. The initial `Lattice`, `lattice` may be constructed with the `newRandom` and the `newCentered` functions.

# Constructing a clustering from the map

## The Clustering type

data Cluster Source

The clusters generated by GSOM basically consist of three things:

Constructors

 Cluster Fieldscenter :: Inputthe vector which best represents all the vectors belonging to this cluster. contents :: [Int]The indices of the input vectors belonging to this cluster. That means a cluster is always relative to a set of `Inputs` coordinates :: Coordinatesthe coordinates of this cluster

Instances

The final clustering which is the result of the GSOM algorithm is a `Data.Map` mapping `Coordinates` to `Cluster`s.
`cluster inputs clustering` clusters the given `inputs` according to the centers of the clusters in `clustering`. That means for each input `i` from `inputs` the index of `i` is added to the contents of the cluster center to which `i` has minimal distance. TODO: Implement tiebreaker.
`clustering lattice` uses the `weights` of the `nodes` stored in `lattice` to generate `cluster`s and returns the `Clustering` storing these `cluster`s. Each non leaf node `n` in `lattice` corresponds to one `cluster` `c` with ```(coordinates c = location n)``` and with `center c` equal to the weight vector of `n`. Each generated `cluster`'s `contents` are empty. Use the `cluster` function with a set of inputs to obtain a clustering where each `Cluster`'s `contents` is a list of the indices of the input points belonging to this `cluster`.
`nearestCluster input clustering` returns the cluster which has the center with the smallest distance to `input`.