hsgsom-0.2.0: An implementation of the GSOM clustering algorithm.

Portabilitynon-portable (requires STM)
Stabilityexperimental
Maintainergnn.github@gmail.com

Data.Datamining.Clustering.Gsom

Contents

Description

This module should contain everything you need to run the GSOM clustering algorithm. It collects and re-exports all important and needed functions from moduls lower in the hirarchy.

Ideally you should never need to look at those modules. If you do need to do this, it is a design failure and I would appreciate it if you would drop me a note.

Synopsis

Algorithm Input

The GSOM algorithm expects a list of input vectors as its input. These input vectors are just lists of Doubles.

type Input = [Double]Source

Input vectors are represented as lists of Doubles.

Input processing

dimension :: Inputs -> IntSource

Calculating the dimension of a given set of inputs just means finding the length of the longest input vector.

type Bounds = [(Double, Double)]Source

The bounds of a list of inputs. Having the tuple (a,b) at index i in bounds means that the value at index i of each of the input vectors from the inputs which where used to calculate bounds is from the intervall [a,b].

bounds :: Inputs -> BoundsSource

Calculates the bounds of the input vector components.

normalize :: Bounds -> Inputs -> InputsSource

Normalizes input vectors. normalize inputs takes the given list of input vectors inputs and returns a list of input vectors where each component is in [0,1]. If you want to unnormalize the input vectors use bounds and unnormalize.

unnormalize :: Bounds -> Inputs -> InputsSource

Unnormalizes the given input vectors inputs assuming that it's bounds previously where bounds.

Operations between input vectors

distance :: Input -> Input -> DoubleSource

distance i1 i2 calculates the euclidean distance between i1 and i2. If i1 and i2 have different lengths, excess values are ignored.

(.*) :: Double -> Input -> InputSource

Multiplication of an input vector with a scalar.

(<+>) :: Input -> Input -> InputSource

Subtraction and addition of vectors between each other.

Auxiliary Types

The map created by GSOM

The growing self organizing map builds, as its name suggests, a map representing the input vectors. This map is of type Lattice and is a network of Nodes.

The Nodes of the map

data Node Source

The type of nodes of a gsom.

Constructors

Leaf

They're either Leafs, signalling neighbours of boundary nodes

Node

or they are actual nodes with a few associated values and a list of neighbouring nodes.

Fields

location :: Coordinates

Used to uniquely identify nodes. This is also the actual location of the node if the lattice it belongs to is beeing laid out in the two dimensional plane and it is used to store the node in the map comprising the lattice.

quantizationError :: TVar Double

The quantization error the node has accumulated so far.

weights :: TVar Input

The node's weight vector. This is the center of the voronoi cell the node represents.

neighbours :: Neighbours

The node's neighbours.

Instances

type Nodes = [Node]Source

type Neighbours = [TVar Node]Source

A node's neighbours are stored in fields of type Neighbours.

type Neighbourhood = [(Int, Node)]Source

The type of neighbourhoods. Wherever a neighbourhood of a node is neede, this type should be used. A Neighbourhood consits of a list of pairs of nodes and their discrete grid distance from the source of the neighbourhood. The source node is the only one with distance 0 while immediate neighbours get distance one and so on.

Node Creation

node :: Coordinates -> Input -> Neighbours -> STM NodeSource

node id weights neighbours creates a node with the specified parameters and initial quantization error of 0.

Modification

propagate :: Node -> Nodes -> STM ()Source

propagate node propagates the accumulated error of the given node to it's neighbours.

update :: Input -> Double -> (Int -> Double) -> (Int, Node) -> STM ()Source

update input learning_rate kernel neighbour updates the weights of the neighbour according to the formula:

  • weight -> weight + learning_rate * (kernel d) (input - weight)

updateError :: Node -> Input -> STM ()Source

updateError node input updates the quantizationError of node. The new error is just the old error plus the distance of the node's weight vector from input.

Querying

isNode :: Node -> BoolSource

isLeaf node returns True if the given node is a Leaf and False otherwise.

neighbourhood :: Node -> Int -> STM NeighbourhoodSource

isNode node returns False if the given node is a Leaf and True otherwise.

Calculates the neighbourhood of the given size of the given node. A neighbourhood size of 0 means that only the given node will be an element of the returned set while a size of one will return the given node and it's immediate neighbours and so on. It's not very efficient so you shouldn't try big neihbourhood sizes. The returned neighbourhood always includes node.

unwrappedNeighbours :: Node -> STM NodesSource

unwrappedNeighbours node returns the list of neighbours of the given node. Note that neighbours is unwrapped, i.e. the returned list hast type Nodes not TVar Nodes.

Debugging

The map type

type Lattice = Map Coordinates (TVar Node)Source

The lattice type. Since global access to nodes is needed they're stored in a Data.Map indexed by their coordinates.

Creating initial lattices

newCentered :: Int -> IO LatticeSource

newNormalized dimension creates a new minimal lattice where weights are initialized with all components having the value 0.5 the and with the weight vectors having length dimension.

newRandom :: RandomGen g => g -> Int -> IO LatticeSource

newRandom g dimension creates a new minimal lattice where weights are randomly initialized with values between 0 and 1 using the random number generator g and with the weight vectors having the specified dimension.

Querying a lattice

bmu :: Input -> Lattice -> STM NodeSource

bmu input lattice returns the best matching unit i.e. the node with minimal distance to the given input vector.

nodes :: Lattice -> STM NodesSource

Returns the nodes stored in lattice.

Debugging and Output

dumpInputs :: Inputs -> StringSource

Dumps the given input vectors to a string which can be fed to gnuplot. Just write the string to a file and write plot "file" in gnuplot.

renderScript :: Clustering -> String -> IO ()Source

renderScript c path expects to be given a Clustering c having 2 dimensional centers and will call error if that's not the case. On success it will save a python script to path.py. If this python script is run it will in turn save a PDF image to path.pdf. The image will contain the graph induced by c with each node (cluster center) placed positioned according to the c's center (weight vector). The python script will depend on the networkx and mathplotlib python packages being installed. I know that this is relatively clunky, but since I haven't found a better way of creating an image of a graph with known node positions, this is the way I chose to go.

Running GSOM

The GSOM algorithm builds the map by sequentially running a given number of Phases over the input.

The Phase type

data Phase Source

This datatype encapsulates all the parameters needed to be known to run one phase of the GSOM algorithm.

Constructors

Phase 

Fields

passes :: Int

The number of passes which are to be made over the input vectors. Since each step of the phase consumes one input vector, the overall number of steps the phase will have will be:

  • steps = passes * length inputs
neighbourhoodSize :: Int

The initial size of the neighbourhood affected by weight adaption. This will decrease linearly while the phase is executed.

learningRate :: LearningRate

The function used to calculate the learning rate in each of the phase's steps. During each step learninRate currentStep maxSteps is fed the number (starting from zero) of the current step as the first argument and the total number of steps the phase will have as the second argument to calculate the learning rate which will be in effect for this phase.

kernel :: Kernel

The kernel function. It is used in conjunction with the learning rate to adjust weight adaption. kernel currentNeighbourhoodsize gridDistance should take the neighbourhood size which is in effect during the current step and a nodes grid distance from the winning node. The neighbourhood size will be a real number due to the linear decrease.

grow :: Bool

The grow flag determines whether this is a growing phase or not. If this is False then no new nodes will be grown.

spreadFactor :: Double

The spread factor is used to calculate the growth threshold according to the formula:

where d is the input dimension.

Instances

data Kernel Source

Constructors

Bubble

The bubble kernel is essentially the identity, i.e. it has no effect.

Gaussian

Let s be the neighbourhood size currently in effect and d be the grid-distance of the current node to the winning node then this kernel calculates a factor to control weight adaption with the following formula:

  • gaussian s d = exp(d^2/(2*s^2))

data LearningRate Source

Constructors

Linear Double

The linear learning rate reduction function. If you supply it with the initial learning rate lr it uses the following formula where step is the current step the phase is in and steps is the overall number of steps the phase will take:

  • linear lr step steps = lr * (1-step/steps)
InverseAge Double

The inverse time learning rate reduction function. Given an initial learning rate of lr, a maximum number of steps of steps and the current step number beeing step, the formula is:

  • inverseAge lr step steps = lr * steps / (steps + 100 * step)

Predefined phases

defaultFirst :: PhaseSource

The default first phase is the only growing phase. It makes 5 passes over the input, uses an initial learning rate of 0.1 and a starting neighbourhood size of 3. The spreadFactor is set to 0.1.

defaultSecond :: PhaseSource

The default for the second phase is a smoothing phase making 50 passes over the input vectors with a learning rate of 0.05 and an initial neighbourhood size of 2. Since there is no node growth the spreadFactor is ignored and thus set to 0.

defaultThird :: PhaseSource

The default for the third and last phase is a smoothing phase making 50 passes over the input vectors with a learning rate of 0.01 and an initial neighbourhood size of 1. Since there is no node growth the spreadFactor is ignored and thus set to 0.

defaults :: PhasesSource

This is the list of the three default phases of the GSOM algorithm.

Running a phase/phases

phase :: Phase -> Lattice -> Inputs -> IO LatticeSource

phase parameters inputs will update the given lattice by executing one phase of the GSOM algorithm with the given inputs and parameters.

run :: Phases -> Lattice -> Inputs -> IO LatticeSource

Since a complete run of the GSOM algorithm means running a number of Phases this is usually the main function used. run phases lattice inputs runs the GSOM algorithm by running the phases in the order specified, each time making passes over inputs and using the produced Lattice to as an argument to the next phase. The initial Lattice, lattice may be constructed with the newRandom and the newCentered functions.

Constructing a clustering from the map

The Clustering type

data Cluster Source

The clusters generated by GSOM basically consist of three things:

Constructors

Cluster 

Fields

center :: Input

the vector which best represents all the vectors belonging to this cluster.

contents :: [Int]

The indices of the input vectors belonging to this cluster. That means a cluster is always relative to a set of Inputs

coordinates :: Coordinates

the coordinates of this cluster

Instances

type Clustering = Map Coordinates ClusterSource

The final clustering which is the result of the GSOM algorithm is a Data.Map mapping Coordinates to Clusters.

Construction and Query

cluster :: Inputs -> Clustering -> ClusteringSource

cluster inputs clustering clusters the given inputs according to the centers of the clusters in clustering. That means for each input i from inputs the index of i is added to the contents of the cluster center to which i has minimal distance. TODO: Implement tiebreaker.

clustering :: Lattice -> IO ClusteringSource

Computes a clustering induced by the given lattice.

clustering lattice uses the weights of the nodes stored in lattice to generate clusters and returns the Clustering storing these clusters. Each non leaf node n in lattice corresponds to one cluster c with (coordinates c = location n) and with center c equal to the weight vector of n. Each generated cluster's contents are empty. Use the cluster function with a set of inputs to obtain a clustering where each Cluster's contents is a list of the indices of the input points belonging to this cluster.

nearestCluster :: Input -> Clustering -> ClusterSource

nearestCluster input clustering returns the cluster which has the center with the smallest distance to input.