Portability | non-portable (requires STM) |
---|---|
Stability | experimental |
Maintainer | gnn.github@gmail.com |
This module should contain everything you need to run the GSOM clustering algorithm. It collects and re-exports all important and needed functions from moduls lower in the hirarchy.
Ideally you should never need to look at those modules. If you do need to do this, it is a design failure and I would appreciate it if you would drop me a note.
- type Input = [Double]
- type Inputs = [Input]
- dimension :: Inputs -> Int
- type Bounds = [(Double, Double)]
- bounds :: Inputs -> Bounds
- normalize :: Bounds -> Inputs -> Inputs
- unnormalize :: Bounds -> Inputs -> Inputs
- distance :: Input -> Input -> Double
- (*.) :: Input -> Double -> Input
- (.*) :: Double -> Input -> Input
- (<+>) :: Input -> Input -> Input
- (<->) :: Input -> Input -> Input
- type Coordinates = (Int, Int)
- data Node
- type Nodes = [Node]
- type Neighbours = [TVar Node]
- type Neighbourhood = [(Int, Node)]
- node :: Coordinates -> Input -> Neighbours -> STM Node
- propagate :: Node -> Nodes -> STM ()
- update :: Input -> Double -> (Int -> Double) -> (Int, Node) -> STM ()
- updateError :: Node -> Input -> STM ()
- isLeaf :: Node -> Bool
- isNode :: Node -> Bool
- neighbourhood :: Node -> Int -> STM Neighbourhood
- unwrappedNeighbours :: Node -> STM Nodes
- putNode :: Node -> IO [String]
- type Lattice = Map Coordinates (TVar Node)
- newCentered :: Int -> IO Lattice
- newRandom :: RandomGen g => g -> Int -> IO Lattice
- bmu :: Input -> Lattice -> STM Node
- nodes :: Lattice -> STM Nodes
- putLattice :: Lattice -> IO String
- putWeights :: Lattice -> IO String
- dumpInputs :: Inputs -> String
- renderScript :: Clustering -> String -> IO ()
- data Phase = Phase {
- passes :: Int
- neighbourhoodSize :: Int
- learningRate :: LearningRate
- kernel :: Kernel
- grow :: Bool
- spreadFactor :: Double
- type Phases = [Phase]
- data Kernel
- data LearningRate
- = Linear Double
- | InverseAge Double
- defaultFirst :: Phase
- defaultSecond :: Phase
- defaultThird :: Phase
- defaults :: Phases
- phase :: Phase -> Lattice -> Inputs -> IO Lattice
- run :: Phases -> Lattice -> Inputs -> IO Lattice
- data Cluster = Cluster {
- center :: Input
- contents :: [Int]
- coordinates :: Coordinates
- type Clustering = Map Coordinates Cluster
- cluster :: Inputs -> Clustering -> Clustering
- clustering :: Lattice -> IO Clustering
- nearestCluster :: Input -> Clustering -> Cluster
Algorithm Input
The GSOM algorithm expects a list of input vectors as its input.
These input vectors are just lists of
s.
Double
Input processing
dimension :: Inputs -> IntSource
Calculating the dimension of a given set of inputs just means finding the length of the longest input vector.
type Bounds = [(Double, Double)]Source
The bounds of a list of inputs. Having the tuple (a,b)
at index i
in bounds
means that the value at index i
of each of the input vectors
from the inputs which where used to calculate bounds
is from the
intervall [a,b]
.
normalize :: Bounds -> Inputs -> InputsSource
Normalizes input vectors.
takes the given list of input vectors normalize
inputsinputs
and
returns a list of input vectors where each component is in [0,1]
.
If you want to unnormalize the input vectors use
and
bounds
.
unnormalize
unnormalize :: Bounds -> Inputs -> InputsSource
Unnormalizes the given input vectors inputs
assuming that it's bounds
previously where bounds
.
Operations between input vectors
distance :: Input -> Input -> DoubleSource
calculates the euclidean distance between
distance
i1 i2i1
and i2
. If i1
and i2
have different lengths, excess
values are ignored.
Auxiliary Types
type Coordinates = (Int, Int)Source
The map created by GSOM
The growing self organizing map builds, as its name suggests, a map
representing the input vectors. This map is of type
and is
a network of Lattice
s.
Node
The Node
s of the map
Node
The type of nodes of a gsom.
Leaf | They're either Leafs, signalling neighbours of boundary nodes |
Node | or they are actual nodes with a few associated values and a list of neighbouring nodes. |
|
type Neighbours = [TVar Node]Source
A node's neighbours are stored in fields of type Neighbours
.
type Neighbourhood = [(Int, Node)]Source
The type of neighbourhoods. Wherever a neighbourhood of a node
is neede, this type should be used.
A Neighbourhood
consits of a list of pairs of nodes and their
discrete grid distance from the source of the neighbourhood.
The source node is the only one with distance 0
while immediate
neighbours get distance one and so on.
Node Creation
node :: Coordinates -> Input -> Neighbours -> STM NodeSource
creates a node with the specified
parameters and initial quantization error of node
id weights neighbours0
.
Modification
propagate :: Node -> Nodes -> STM ()Source
propagates the accumulated error of the given propagate
nodenode
to it's neighbours.
update :: Input -> Double -> (Int -> Double) -> (Int, Node) -> STM ()Source
updates
the weights of the update
input learning_rate kernel neighbourneighbour
according to the formula:
weight -> weight + learning_rate * (kernel d) (input - weight)
updateError :: Node -> Input -> STM ()Source
updateError node input
updates the
of quantizationError
node
.
The new error is just the old error plus the distance of the node
's
weight vector from input
.
Querying
neighbourhood :: Node -> Int -> STM NeighbourhoodSource
returns isNode
node
if the given node is a False
and
Leaf
otherwise.
True
Calculates the neighbourhood of the given size of the given node.
A neighbourhood size of 0
means that only the given node will be
an element of the returned set while a size of one will return the
given node and it's immediate neighbours and so on.
It's not very efficient so you shouldn't try big neihbourhood sizes.
The returned neighbourhood always includes node
.
unwrappedNeighbours :: Node -> STM NodesSource
returns the list of neighbours of the
given unwrappedNeighbours
nodenode
.
Note that neighbours is unwrapped, i.e. the returned list hast type
not Nodes
TVar
.
Nodes
Debugging
The map type
type Lattice = Map Coordinates (TVar Node)Source
The lattice type. Since global access to nodes is needed they're
stored in a Data.Map
indexed by their coordinates.
Creating initial lattices
newCentered :: Int -> IO LatticeSource
creates a new minimal lattice where weights
are initialized with all components having the value newNormalized
dimension0.5
the and with
the weight vectors having length dimension
.
newRandom :: RandomGen g => g -> Int -> IO LatticeSource
creates a new minimal lattice where weights are
randomly initialized with values between 0 and 1 using the random number
generator newRandom
g dimensiong
and with the weight vectors having the specified dimension
.
Querying a lattice
bmu :: Input -> Lattice -> STM NodeSource
returns the best matching unit i.e. the node with
minimal distance to the given input vector.
bmu
input lattice
Debugging and Output
putLattice :: Lattice -> IO StringSource
putWeights :: Lattice -> IO StringSource
dumpInputs :: Inputs -> StringSource
Dumps the given input vectors to a string which can be fed to
gnuplot. Just write the string to a file and write plot "file"
in
gnuplot.
renderScript :: Clustering -> String -> IO ()Source
expects to be given a renderScript
c pathClustering
c
having 2 dimensional center
s and will call error
if that's not
the case. On success it will save a python script to path
.py. If
this python script is run it will in turn save a PDF image to
path
.pdf. The image will contain the graph induced by c
with each
node (cluster center) placed positioned according to the c
's
center (weight vector). The python script will depend on the
networkx
and mathplotlib
python packages being installed.
I know that this is relatively clunky, but since I haven't found a
better way of creating an image of a graph with known node positions,
this is the way I chose to go.
Running GSOM
The Phase type
This datatype encapsulates all the parameters needed to be known to run one phase of the GSOM algorithm.
Phase | |
|
Bubble | The bubble kernel is essentially the identity, i.e. it has no effect. |
Gaussian | Let
|
data LearningRate Source
Linear Double | The linear learning rate reduction function. If you supply it with
the initial learning rate
|
InverseAge Double | The inverse time learning rate reduction function. Given an initial
learning rate of
|
Predefined phases
The default first phase is the only growing phase. It makes 5
passes over the input, uses an initial learning rate of 0.1 and
a starting neighbourhood size of 3. The
is set
to 0.1.
spreadFactor
The default for the second phase is a smoothing phase making 50
passes over the input vectors with a learning rate of 0.05 and an
initial neighbourhood size of 2. Since there is no node growth the
is ignored and thus set to 0.
spreadFactor
The default for the third and last phase is a smoothing phase making
50 passes over the input vectors with a learning rate of 0.01 and
an initial neighbourhood size of 1. Since there is no node growth the
is ignored and thus set to 0.
spreadFactor
Running a phase/phases
phase :: Phase -> Lattice -> Inputs -> IO LatticeSource
will update the given phase
parameters inputslattice
by
executing one phase of the GSOM algorithm with the given inputs
and parameters
.
run :: Phases -> Lattice -> Inputs -> IO LatticeSource
Since a complete run of the GSOM algorithm means running a number of
this is usually the main function used.
Phases
run phases lattice inputs
runs the GSOM algorithm by running the
phases
in the order specified, each time making passes over inputs
and using the produced
to as an argument to the next phase.
The initial Lattice
, Lattice
lattice
may be constructed with the
and the newRandom
functions.
newCentered
Constructing a clustering from the map
The Clustering type
The clusters generated by GSOM basically consist of three things:
Cluster | |
|
type Clustering = Map Coordinates ClusterSource
The final clustering which is the result of the GSOM algorithm
is a
mapping Data.Map
to Coordinates
s.
Cluster
Construction and Query
cluster :: Inputs -> Clustering -> ClusteringSource
clusters the given cluster
inputs clusteringinputs
according to
the centers of the clusters in clustering
. That means for each input i
from inputs
the index of i
is added to the contents of the cluster
center to which i
has minimal distance.
TODO: Implement tiebreaker.
clustering :: Lattice -> IO ClusteringSource
Computes a clustering induced by the given lattice.
uses the clustering
lattice
of the weights
stored
in nodes
lattice
to generate cluster
s and returns the
storing these Clustering
cluster
s. Each non leaf node n
in lattice
corresponds to one cluster
c
with (
and with coordinates
c = location
n)
equal to the weight vector of center
cn
. Each
generated cluster
's contents
are empty. Use the cluster
function with a set of inputs to obtain a clustering where each
Cluster
's contents
is a list of the indices of the input
points belonging to this cluster
.
nearestCluster :: Input -> Clustering -> ClusterSource
returns the cluster which has
the center with the smallest distance to nearestCluster
input clusteringinput
.