hsgsom-0.2.0: An implementation of the GSOM clustering algorithm.

Portability non-portable (requires STM) experimental gnn.github@gmail.com

Data.Datamining.Clustering.Gsom

Description

This module should contain everything you need to run the GSOM clustering algorithm. It collects and re-exports all important and needed functions from moduls lower in the hirarchy.

Ideally you should never need to look at those modules. If you do need to do this, it is a design failure and I would appreciate it if you would drop me a note.

Synopsis

# Algorithm Input

The GSOM algorithm expects a list of input vectors as its input. These input vectors are just lists of Doubles.

type Input = [Double]Source

Input vectors are represented as lists of Doubles.

## Input processing

Calculating the dimension of a given set of inputs just means finding the length of the longest input vector.

type Bounds = [(Double, Double)]Source

The bounds of a list of inputs. Having the tuple (a,b) at index i in bounds means that the value at index i of each of the input vectors from the inputs which where used to calculate bounds is from the intervall [a,b].

Calculates the bounds of the input vector components.

Normalizes input vectors. normalize inputs takes the given list of input vectors inputs and returns a list of input vectors where each component is in [0,1]. If you want to unnormalize the input vectors use bounds and unnormalize.

Unnormalizes the given input vectors inputs assuming that it's bounds previously where bounds.

## Operations between input vectors

distance i1 i2 calculates the euclidean distance between i1 and i2. If i1 and i2 have different lengths, excess values are ignored.

Multiplication of an input vector with a scalar.

Subtraction and addition of vectors between each other.

# The map created by GSOM

The growing self organizing map builds, as its name suggests, a map representing the input vectors. This map is of type Lattice and is a network of Nodes.

## The Nodes of the map

data Node Source

The type of nodes of a gsom.

Constructors

 Leaf They're either Leafs, signalling neighbours of boundary nodes Node or they are actual nodes with a few associated values and a list of neighbouring nodes. Fieldslocation :: CoordinatesUsed to uniquely identify nodes. This is also the actual location of the node if the lattice it belongs to is beeing laid out in the two dimensional plane and it is used to store the node in the map comprising the lattice. quantizationError :: TVar DoubleThe quantization error the node has accumulated so far. weights :: TVar InputThe node's weight vector. This is the center of the voronoi cell the node represents. neighbours :: NeighboursThe node's neighbours.

Instances

 Eq Node

type Nodes = [Node]Source

type Neighbours = [TVar Node]Source

A node's neighbours are stored in fields of type Neighbours.

type Neighbourhood = [(Int, Node)]Source

The type of neighbourhoods. Wherever a neighbourhood of a node is neede, this type should be used. A Neighbourhood consits of a list of pairs of nodes and their discrete grid distance from the source of the neighbourhood. The source node is the only one with distance 0 while immediate neighbours get distance one and so on.

### Node Creation

node id weights neighbours creates a node with the specified parameters and initial quantization error of 0.

### Modification

propagate node propagates the accumulated error of the given node to it's neighbours.

update :: Input -> Double -> (Int -> Double) -> (Int, Node) -> STM ()Source

update input learning_rate kernel neighbour updates the weights of the neighbour according to the formula:

• weight -> weight + learning_rate * (kernel d) (input - weight)

updateError node input updates the quantizationError of node. The new error is just the old error plus the distance of the node's weight vector from input.

### Querying

isLeaf node returns True if the given node is a Leaf and False otherwise.

isNode node returns False if the given node is a Leaf and True otherwise.

Calculates the neighbourhood of the given size of the given node. A neighbourhood size of 0 means that only the given node will be an element of the returned set while a size of one will return the given node and it's immediate neighbours and so on. It's not very efficient so you shouldn't try big neihbourhood sizes. The returned neighbourhood always includes node.

unwrappedNeighbours node returns the list of neighbours of the given node. Note that neighbours is unwrapped, i.e. the returned list hast type Nodes not TVar Nodes.

## The map type

The lattice type. Since global access to nodes is needed they're stored in a Data.Map indexed by their coordinates.

### Creating initial lattices

newNormalized dimension creates a new minimal lattice where weights are initialized with all components having the value 0.5 the and with the weight vectors having length dimension.

newRandom :: RandomGen g => g -> Int -> IO LatticeSource

newRandom g dimension creates a new minimal lattice where weights are randomly initialized with values between 0 and 1 using the random number generator g and with the weight vectors having the specified dimension.

### Querying a lattice

bmu input lattice returns the best matching unit i.e. the node with minimal distance to the given input vector.

Returns the nodes stored in lattice.

### Debugging and Output

Dumps the given input vectors to a string which can be fed to gnuplot. Just write the string to a file and write plot "file" in gnuplot.

renderScript c path expects to be given a Clustering c having 2 dimensional centers and will call error if that's not the case. On success it will save a python script to path.py. If this python script is run it will in turn save a PDF image to path.pdf. The image will contain the graph induced by c with each node (cluster center) placed positioned according to the c's center (weight vector). The python script will depend on the networkx and mathplotlib python packages being installed. I know that this is relatively clunky, but since I haven't found a better way of creating an image of a graph with known node positions, this is the way I chose to go.

# Running GSOM

The GSOM algorithm builds the map by sequentially running a given number of Phases over the input.

## The Phase type

data Phase Source

This datatype encapsulates all the parameters needed to be known to run one phase of the GSOM algorithm.

Constructors

 Phase Fieldspasses :: IntThe number of passes which are to be made over the input vectors. Since each step of the phase consumes one input vector, the overall number of steps the phase will have will be: steps = passes * length inputsneighbourhoodSize :: IntThe initial size of the neighbourhood affected by weight adaption. This will decrease linearly while the phase is executed. learningRate :: LearningRateThe function used to calculate the learning rate in each of the phase's steps. During each step learninRate currentStep maxSteps is fed the number (starting from zero) of the current step as the first argument and the total number of steps the phase will have as the second argument to calculate the learning rate which will be in effect for this phase. kernel :: KernelThe kernel function. It is used in conjunction with the learning rate to adjust weight adaption. kernel currentNeighbourhoodsize gridDistance should take the neighbourhood size which is in effect during the current step and a nodes grid distance from the winning node. The neighbourhood size will be a real number due to the linear decrease. grow :: BoolThe grow flag determines whether this is a growing phase or not. If this is False then no new nodes will be grown. spreadFactor :: DoubleThe spread factor is used to calculate the growth threshold according to the formula: GT = - sqrt(d)*ln(spreadFactor)where d is the input dimension.

Instances

data Kernel Source

Constructors

 Bubble The bubble kernel is essentially the identity, i.e. it has no effect. Gaussian Let s be the neighbourhood size currently in effect and d be the grid-distance of the current node to the winning node then this kernel calculates a factor to control weight adaption with the following formula: gaussian s d = exp(d^2/(2*s^2))

Instances

 Enum Kernel Eq Kernel Ord Kernel Read Kernel Show Kernel

Constructors

 Linear Double The linear learning rate reduction function. If you supply it with the initial learning rate lr it uses the following formula where step is the current step the phase is in and steps is the overall number of steps the phase will take: linear lr step steps = lr * (1-step/steps) InverseAge Double The inverse time learning rate reduction function. Given an initial learning rate of lr, a maximum number of steps of steps and the current step number beeing step, the formula is: inverseAge lr step steps = lr * steps / (steps + 100 * step)

Instances

### Predefined phases

The default first phase is the only growing phase. It makes 5 passes over the input, uses an initial learning rate of 0.1 and a starting neighbourhood size of 3. The spreadFactor is set to 0.1.

The default for the second phase is a smoothing phase making 50 passes over the input vectors with a learning rate of 0.05 and an initial neighbourhood size of 2. Since there is no node growth the spreadFactor is ignored and thus set to 0.

The default for the third and last phase is a smoothing phase making 50 passes over the input vectors with a learning rate of 0.01 and an initial neighbourhood size of 1. Since there is no node growth the spreadFactor is ignored and thus set to 0.

This is the list of the three default phases of the GSOM algorithm.

### Running a phase/phases

phase parameters inputs will update the given lattice by executing one phase of the GSOM algorithm with the given inputs and parameters.

Since a complete run of the GSOM algorithm means running a number of Phases this is usually the main function used. run phases lattice inputs runs the GSOM algorithm by running the phases in the order specified, each time making passes over inputs and using the produced Lattice to as an argument to the next phase. The initial Lattice, lattice may be constructed with the newRandom and the newCentered functions.

# Constructing a clustering from the map

## The Clustering type

data Cluster Source

The clusters generated by GSOM basically consist of three things:

Constructors

 Cluster Fieldscenter :: Inputthe vector which best represents all the vectors belonging to this cluster. contents :: [Int]The indices of the input vectors belonging to this cluster. That means a cluster is always relative to a set of Inputs coordinates :: Coordinatesthe coordinates of this cluster

Instances