| Portability | non-portable (requires STM) |
|---|---|
| Stability | experimental |
| Maintainer | gnn.github@gmail.com |
Data.Datamining.Clustering.Gsom
Description
This module should contain everything you need to run the GSOM clustering algorithm. It collects and re-exports all important and needed functions from moduls lower in the hirarchy.
Ideally you should never need to look at those modules. If you do need to do this, it is a design failure and I would appreciate it if you would drop me a note.
- type Input = [Double]
- type Inputs = [Input]
- dimension :: Inputs -> Int
- type Bounds = [(Double, Double)]
- bounds :: Inputs -> Bounds
- normalize :: Bounds -> Inputs -> Inputs
- unnormalize :: Bounds -> Inputs -> Inputs
- distance :: Input -> Input -> Double
- (*.) :: Input -> Double -> Input
- (.*) :: Double -> Input -> Input
- (<+>) :: Input -> Input -> Input
- (<->) :: Input -> Input -> Input
- type Coordinates = (Int, Int)
- data Node
- type Nodes = [Node]
- type Neighbours = [TVar Node]
- type Neighbourhood = [(Int, Node)]
- node :: Coordinates -> Input -> Neighbours -> STM Node
- propagate :: Node -> Nodes -> STM ()
- update :: Input -> Double -> (Int -> Double) -> (Int, Node) -> STM ()
- updateError :: Node -> Input -> STM ()
- isLeaf :: Node -> Bool
- isNode :: Node -> Bool
- neighbourhood :: Node -> Int -> STM Neighbourhood
- unwrappedNeighbours :: Node -> STM Nodes
- putNode :: Node -> IO [String]
- type Lattice = Map Coordinates (TVar Node)
- newCentered :: Int -> IO Lattice
- newRandom :: RandomGen g => g -> Int -> IO Lattice
- bmu :: Input -> Lattice -> STM Node
- nodes :: Lattice -> STM Nodes
- putLattice :: Lattice -> IO String
- putWeights :: Lattice -> IO String
- dumpInputs :: Inputs -> String
- renderScript :: Clustering -> String -> IO ()
- data Phase = Phase {
- passes :: Int
- neighbourhoodSize :: Int
- learningRate :: LearningRate
- kernel :: Kernel
- grow :: Bool
- spreadFactor :: Double
- type Phases = [Phase]
- data Kernel
- data LearningRate
- = Linear Double
- | InverseAge Double
- defaultFirst :: Phase
- defaultSecond :: Phase
- defaultThird :: Phase
- defaults :: Phases
- phase :: Phase -> Lattice -> Inputs -> IO Lattice
- run :: Phases -> Lattice -> Inputs -> IO Lattice
- data Cluster = Cluster {
- center :: Input
- contents :: [Int]
- coordinates :: Coordinates
- type Clustering = Map Coordinates Cluster
- cluster :: Inputs -> Clustering -> Clustering
- clustering :: Lattice -> IO Clustering
- nearestCluster :: Input -> Clustering -> Cluster
Algorithm Input
The GSOM algorithm expects a list of input vectors as its input.
These input vectors are just lists of s.
Double
Input processing
dimension :: Inputs -> IntSource
Calculating the dimension of a given set of inputs just means finding the length of the longest input vector.
type Bounds = [(Double, Double)]Source
The bounds of a list of inputs. Having the tuple (a,b) at index i
in bounds means that the value at index i of each of the input vectors
from the inputs which where used to calculate bounds is from the
intervall [a,b].
normalize :: Bounds -> Inputs -> InputsSource
Normalizes input vectors.
takes the given list of input vectors normalize inputsinputs and
returns a list of input vectors where each component is in [0,1].
If you want to unnormalize the input vectors use and
bounds.
unnormalize
unnormalize :: Bounds -> Inputs -> InputsSource
Unnormalizes the given input vectors inputs assuming that it's bounds
previously where bounds.
Operations between input vectors
distance :: Input -> Input -> DoubleSource
calculates the euclidean distance between
distance i1 i2i1 and i2. If i1 and i2 have different lengths, excess
values are ignored.
Auxiliary Types
type Coordinates = (Int, Int)Source
The map created by GSOM
The growing self organizing map builds, as its name suggests, a map
representing the input vectors. This map is of type and is
a network of Lattices.
Node
The Nodes of the map
NodeThe type of nodes of a gsom.
Constructors
| Leaf | They're either Leafs, signalling neighbours of boundary nodes |
| Node | or they are actual nodes with a few associated values and a list of neighbouring nodes. |
Fields
| |
type Neighbours = [TVar Node]Source
A node's neighbours are stored in fields of type Neighbours.
type Neighbourhood = [(Int, Node)]Source
The type of neighbourhoods. Wherever a neighbourhood of a node
is neede, this type should be used.
A Neighbourhood consits of a list of pairs of nodes and their
discrete grid distance from the source of the neighbourhood.
The source node is the only one with distance 0 while immediate
neighbours get distance one and so on.
Node Creation
node :: Coordinates -> Input -> Neighbours -> STM NodeSource
creates a node with the specified
parameters and initial quantization error of node id weights neighbours0.
Modification
propagate :: Node -> Nodes -> STM ()Source
propagates the accumulated error of the given propagate nodenode
to it's neighbours.
update :: Input -> Double -> (Int -> Double) -> (Int, Node) -> STM ()Source
updates
the weights of the update input learning_rate kernel neighbourneighbour according to the formula:
weight -> weight + learning_rate * (kernel d) (input - weight)
updateError :: Node -> Input -> STM ()Source
updateError node input updates the of quantizationErrornode.
The new error is just the old error plus the distance of the node's
weight vector from input.
Querying
neighbourhood :: Node -> Int -> STM NeighbourhoodSource
returns isNode node if the given node is a False and
Leaf otherwise.
True
Calculates the neighbourhood of the given size of the given node.
A neighbourhood size of 0 means that only the given node will be
an element of the returned set while a size of one will return the
given node and it's immediate neighbours and so on.
It's not very efficient so you shouldn't try big neihbourhood sizes.
The returned neighbourhood always includes node.
unwrappedNeighbours :: Node -> STM NodesSource
returns the list of neighbours of the
given unwrappedNeighbours nodenode.
Note that neighbours is unwrapped, i.e. the returned list hast type
not NodesTVar .
Nodes
Debugging
The map type
type Lattice = Map Coordinates (TVar Node)Source
The lattice type. Since global access to nodes is needed they're
stored in a Data.Map indexed by their coordinates.
Creating initial lattices
newCentered :: Int -> IO LatticeSource
creates a new minimal lattice where weights
are initialized with all components having the value newNormalized dimension0.5 the and with
the weight vectors having length dimension.
newRandom :: RandomGen g => g -> Int -> IO LatticeSource
creates a new minimal lattice where weights are
randomly initialized with values between 0 and 1 using the random number
generator newRandom g dimensiong and with the weight vectors having the specified dimension.
Querying a lattice
bmu :: Input -> Lattice -> STM NodeSource
returns the best matching unit i.e. the node with
minimal distance to the given input vector.
bmu input lattice
Debugging and Output
putLattice :: Lattice -> IO StringSource
putWeights :: Lattice -> IO StringSource
dumpInputs :: Inputs -> StringSource
Dumps the given input vectors to a string which can be fed to
gnuplot. Just write the string to a file and write plot "file" in
gnuplot.
renderScript :: Clustering -> String -> IO ()Source
expects to be given a renderScript c pathClustering c
having 2 dimensional centers and will call error if that's not
the case. On success it will save a python script to path.py. If
this python script is run it will in turn save a PDF image to
path.pdf. The image will contain the graph induced by c with each
node (cluster center) placed positioned according to the c's
center (weight vector). The python script will depend on the
networkx and mathplotlib python packages being installed.
I know that this is relatively clunky, but since I haven't found a
better way of creating an image of a graph with known node positions,
this is the way I chose to go.
Running GSOM
The Phase type
This datatype encapsulates all the parameters needed to be known to run one phase of the GSOM algorithm.
Constructors
| Phase | |
Fields
| |
Constructors
| Bubble | The bubble kernel is essentially the identity, i.e. it has no effect. |
| Gaussian | Let
|
data LearningRate Source
Constructors
| Linear Double | The linear learning rate reduction function. If you supply it with
the initial learning rate
|
| InverseAge Double | The inverse time learning rate reduction function. Given an initial
learning rate of
|
Instances
Predefined phases
The default first phase is the only growing phase. It makes 5
passes over the input, uses an initial learning rate of 0.1 and
a starting neighbourhood size of 3. The is set
to 0.1.
spreadFactor
The default for the second phase is a smoothing phase making 50
passes over the input vectors with a learning rate of 0.05 and an
initial neighbourhood size of 2. Since there is no node growth the
is ignored and thus set to 0.
spreadFactor
The default for the third and last phase is a smoothing phase making
50 passes over the input vectors with a learning rate of 0.01 and
an initial neighbourhood size of 1. Since there is no node growth the
is ignored and thus set to 0.
spreadFactor
Running a phase/phases
phase :: Phase -> Lattice -> Inputs -> IO LatticeSource
will update the given phase parameters inputslattice by
executing one phase of the GSOM algorithm with the given inputs
and parameters.
run :: Phases -> Lattice -> Inputs -> IO LatticeSource
Since a complete run of the GSOM algorithm means running a number of
this is usually the main function used.
Phasesrun phases lattice inputs runs the GSOM algorithm by running the
phases in the order specified, each time making passes over inputs
and using the produced to as an argument to the next phase.
The initial Lattice, Latticelattice may be constructed with the
and the newRandom functions.
newCentered
Constructing a clustering from the map
The Clustering type
The clusters generated by GSOM basically consist of three things:
Constructors
| Cluster | |
Fields
| |
type Clustering = Map Coordinates ClusterSource
The final clustering which is the result of the GSOM algorithm
is a mapping Data.Map to Coordinatess.
Cluster
Construction and Query
cluster :: Inputs -> Clustering -> ClusteringSource
clusters the given cluster inputs clusteringinputs according to
the centers of the clusters in clustering. That means for each input i
from inputs the index of i is added to the contents of the cluster
center to which i has minimal distance.
TODO: Implement tiebreaker.
clustering :: Lattice -> IO ClusteringSource
Computes a clustering induced by the given lattice.
uses the clustering lattice of the weights stored
in nodeslattice to generate clusters and returns the
storing these Clusteringclusters. Each non leaf node n in lattice
corresponds to one cluster c with ( and with coordinates c = location
n) equal to the weight vector of center cn. Each
generated cluster's contents are empty. Use the cluster
function with a set of inputs to obtain a clustering where each
Cluster's contents is a list of the indices of the input
points belonging to this cluster.
nearestCluster :: Input -> Clustering -> ClusterSource
returns the cluster which has
the center with the smallest distance to nearestCluster input clusteringinput.