DD2437 Lab assignment 2 Radial basis functions, competitive learning and self-organisation solution

$25.00

Original Work ?
Category: You will Instantly receive a download link for .ZIP solution file upon Payment

Description

5/5 - (1 vote)

1 Introduction

In this lab the focus will be on unsupervised neural network approaches that
involve competitive learning and the related concept of self-organisation. In this
context, you will also experiment with Radial-Basis Function (RBF) networks,
which incorporate both unsupervised and supervised learning to address classi-
cation and regression tasks.

The second part of the lab is devoted to the most famous representative of self-organising NNs – Kohonen maps, commonly referred
to self-organising maps (SOMs).

These networks map points in the input space
to points in the output space with the preservation of the topology, which means
that points which are close in the input space (usually high-dimensional) should
also be close in the output space (usually low-dimensional, i.e. 1D, 2D or 3D).

These networks can be used to help visualise high-dimensional data by nding
suitable low-dimensional manifolds or perform clustering in high-dimensional
spaces. In both cases the objective is to organise and present complex data in
an intuitive visual form understandable for humans.

1.1 Aims and objectives
When you successfully complete the assignment you should:
• know how to build the structure and perform training of an RBF network
for either classication or regression purposes

• be able to comparatively analyse dierent methods for initialising the
structure and learning the weights in an RBF network
• know the concept of vector quantisation and learn how to use it in NN
context

• be able to recognise and implement dierent components in the SOM
algorithm
• be able to discuss the role of the neighbourhood and analyse its eect on
the self-organisation in SOMs
• know how SOM-networks can be used to fold high-dimensional spaces and
cluster data

1.2 Scope

In this exercise you will implement the core algorithm of SOM and use it for
three dierent tasks. The rst is to order objects (animals) in a sequential order
according to their attributes. The second is to nd a circular tour which passes
ten prescribed points in the plane. The third is to make a two-dimensional map
over voting behaviour of members of the swedish parliament. In all three cases
the algorithm is supposed to nd a low-dimensional representation of higherdimensional data.

2 Background

2.1 Radial-basis function networks
In this experiment you will use a set of RBFs to approximate some simple
functions of one variable. The network is shown in g 1.
ϕ ( ) x
i
wi
Input Hidden RBFs Output
x f(x)
Σw ϕ ( ) x
ii

Figure 1: A simple RBF network which is used for one dimensional function approximation. The weights between the RBFs and the output layer may be trained in several
ways. Two methods which are used in this exercise is the method of least squares and
the delta rule.

We will use Gaussian RBFs, i.e. the units in the hidden layer implement the
following transfer function:
φi(x) = e

−(x−µi
)
2
2σ2
i

(1)
2
where µi
is the position for unit i and σ
2
i
is its variance. The output layer
calculates the weighted sum of the n hidden layer units:
ˆf(x) = Xn
i
wiφi(x) (2)
where ˆf(x) is an approximation of the desired function.

The units in the hidden layer are called radial-basis functions since they
work as a basis in which the function ˆf(x) can be expressed.1 The units are
often radially symmetric as is the case in (1).

Look again at gure 1. Note how the set of RBFs maps each pattern in the
input space to an n-dimensional vector. n is usually higher than the dimension
of the input space. Patterns belonging to dierent classes in a classication
task are usually easier to separate in the higher-dimensional space of the hidden
layer than in the input space. In fact, two sets of patterns which are not linearly
separable in the input space can be made linearly separable in the space of the
hidden layer.

2.1.1 Computing the weight matrix
The learning algorithm should nd a set of weights wi so that ˆf(x) is a good
approximation of f(x), i.e. we want to nd weights which minimize the total
approximation error summed over all N patterns used as training examples:
total error =
X
N
k
(
ˆf(xk) − f(xk))2
(3)

We begin by dening fk = f(xk), where f(·) is the target function and xk is the
kth pattern, and write a linear equation system with one row per pattern and
where each row states equation (2) for a particular pattern:
φ1(x1)w1 + φ2(x1)w2 + · · · + φn(x1)wn = f1
φ1(x2)w1 + φ2(x2)w2 + · · · + φn(x2)wn = f2
.
.
.
φ1(xk)w1 + φ2(xk)w2 + · · · + φn(xk)wn = fk
.
.
.
φ1(xN )w1 + φ2(xN )w2 + · · · + φn(xN )wn = fN
(4)

If N > n, the system is overdetermined so we cannot use Gaussian elimination
directly to solve for w. In fact, in practice there is no exact solution to (4).

Please, reect in this context over the following questions:
• What is the lower bound for the number of training examples, N?
• What happens with the error if N = n? Why?
• Under what conditions, if any, does (4) have a solution in this case?
1Note that fˆ(x) in (2) is a linear combination of radial-basis functions, just as a vector in
a two-dimensional Cartesian space is a linear combination of the basis vectors e¯x and e¯y.

• During training we use an error measure dened over the training examples. Is it good to use this measure when evaluating the performance of
the network? Explain!

Below we will look at two methods for determining the weights wi
, batch
learning using least squares and sequential (incremental, on-line) learning using
the delta rule. In both cases, the objective is to minimise (3).
Least squares We can write (4) as
Φw = f (5)
where
Φ =


φ1(x1) φ2(x1) . . . φn(x1)
φ1(x2) φ2(x2) . . . φn(x2)
.
.
.
.
.
.
.
.
.
.
.
.
φ1(xN ) φ2(xN ) . . . φn(xN )


and w =


w1
w2
.
.
.
wn


(6)
Our error function (3) becomes
total error = kΦw − fk
2
(7)

According to standard textbooks in linear algebra and numerical analysis, we
obtain the w which minimizes (7) by solving the system
Φ>Φw = Φ>f (8)
for w. (8) are the normal equations. The result is called the least squares
solution of (5).

The delta rule It is not always that all sample patterns from the input space
are accessible simultaneously. It is, on the contrary, a rather common situation
that a neural network is operating on a continuous stream of data, to which
it needs to adapt. Using linear algebra, it is possible to extend the technique
described above to the case where the network needs to respond and adapt to
newly incoming data at each time-step.2

We will instead derive the delta rule for RBF network, which is an application
of the gradient descent method to neural networks, as you should know by now.

The derivation follows the same line of reasoning as for the perceptron networks
in Lab 1. Here we emphasise the inceremental nature of learning (unlike focus on
batch in the discussion of delta and generalised delta rules in Lab 1). Therefore,
in this new context we assume that we may not have a xed set of input patterns,
so we approximate the error criterion (3) using the instantaneous error,
ˆξ , as
an estimate for the expected error, ξ, given the most recent pattern sample xk:
ξ ≈ ˆξ =
1
2
(f(xk) − ˆf(xk))2 =
1
2
e
2
(9)
2C.f. recursive least squares and the Kalman lter.

We want ˆξ −→ 0 as t −→ ∞, and we want it to go as fast as possible.
Therefore, for each time step, we take a step ∆w in the direction where the
error surface ξˆ(w) is steepest, i.e. in the negative gradient of the error surface:
∆w = −η∇w
ˆξ
= −η
1
2
∇w(f(xk) − Φ(xk)
>w)
2
= η(f(xk) − Φ(xk)
>w)Φ(xk)
= ηeΦ(xk)
(10)
where
Φ(xk) =


φ1(xk)
φ2(xk)
.
.
.
φn(xk)



As you know, equation (10) is the delta rule and η is the learning rate constant.
Ideally, η is large enough so that we get to the optimum (the least squares
solution) in one step. However, since ˆξ only is an approximation and since
the data contains noise, a too large η could cause us to miss the optimum and
instead add error to w. This can give rise to oscillations.

2.2 The SOM algorithm
The basic algorithm is fairly simple. For each training example:
1. Calculate the similarity between the input pattern and the weights arriving
at each output node.

2. Find the most similar node; often referred to as the winner.

3. Select a set of output nodes which are located close to the winner in the
output grid. This is called the neighbourhood.

4. Update the weights of all nodes in the neighbourhood such that their
weights are moved closer to the input pattern.

We will now go through these steps in somewhat more detail.

Measuring similarity Similarity is normally measured by calculating the
Euclidean distance between the input pattern and the weight vector. Note that
this implies that the weights describe the centers of nodes in the input space,
i.e. they are not multiplied with the input as in perceptron. If we have the
input pattern x¯ and the i’th node has a weight vector w¯i
, then the distance for
that node is
di = ||x¯ − w¯i
|| =
q
(¯x − w¯i)
T · (¯x − w¯i)

We only need to nd out which output node has the minimal distance to the
input pattern. The actual distance values are not interesting, therefore we can
skip the square root operation to save some compute time. The same node will
still be the winner.

Neighbourhood The neighbourhood denes the set of nodes close enough to
the winner to get the privilige of having its weights updated. It is important not
to confuse the distances in the previous step with the distances in the neighbourhood calculation.

Earlier we talked about distances or rather measuring
similarity in the input space, while the neighbourhood is dened in terms of
the output space. The nodes are arranged in a grid. When a winning node has
been found, the neighbourhood constitutes the surrounding nodes in this grid.

In a one-dimensional grid, the neighbors are simply the nodes where the index
diers less than a prescribed integer number. In the two-dimensional case, is is
normally sucient to use a so called Manhattan distance, i.e. to add the absolute values of the index dierences in row and column directions.

Sometimes
it may be easier to describe it in terms of a Gaussian function (in the discrete
space of indices).

One important consideration is how large the neighbourhood should be. The
best strategy is normally to start o with a rather large neighbourhood and gradually making it smaller.

The large neighbourhood at the beginning is necessary
to get an overall organization while the small neighbourhood at the end makes
the detailed positioning correct. Often some trial-and-error experimenting is
needed to nd a good strategy for handling the neighbourhood sizes.

In one of the tasks in this exercise you will need a circular one-dimensional
neighbourhood. This means that nodes in one end of the vector of nodes are close
to those in the other end. This can be achieved by modifying how dierences
between indices are calculated.

Weight modication Only the nodes suciently close to the winner, i.e. the
neighbours, will have their weights updated. The update rule is very simple:
the weight vector is simply moved a bit closer to the input pattern:
w¯i ← w¯i + η(¯x − w¯i)
where η is the step size. A reasonable step size in these tasks is η = 0.2.

The algorithm is so imple that you are asked here to write all the implementation by yourselves from scratch without relying on any dedicated SOM
libraries.

Presentation of the result After learning, the weight vectors represent the
positions in the input space where the output nodes give maximal response.

However, this is normally not what we want to show. It is often much more
interesting to see where dierent input patterns end up in the output grid. In
these examples we will use the training patterns also for probing the resulting
network.

Thus, we loop through the input patterns once more, but this time
we only calculate the winning node. Depending on the type of data, dierent
techniques can then be used to present the mapping from pattern to grid index.

In the rst example you will sort the patterns in order of winner indices; in the
last example you will use color to visualize where dierent input patterns end
up.

3 Assignment – Part I

Important notes:
• Please execute all the tasks below and report them in your short reports
where you focus on key points. For presentation of your results in lab
sessions please choose the key ndings so that you could tell a short but
coherent and insightful story.

• Most tasks are formulated with a great level of detail whereas the more
advanced ones allow certain room for your own assumptions, decisions.
Please be clear about these assumptions even if they are arbitrary (you
have full freedom to make these simplifying decisions unless it is clearly
stated that you should not.)

• Most developments in this part of the lab assignment should be implemented from scratch wthout using any dedicated libraries for neural networks (except for a comparative analysis with a multi-layer perceptron).

The computations are relatively easy to code and light to run. If you get
stuck I see a possibility of relying on the existing functions for competitive
learning (CL).

3.1 Batch mode training using least squares – supervised
learning of network weights
In this simple assignment, you should focus on supervised learning of weights
of the RBF network built to address a simple regression problem.

Please implement both batch and incremental learning algorithms (you will need delta
rule for incremental learning in the next task when noise is introduced) from
scratch without using dedicated NN toolboxes. The two function to approximare are sin(2x) and square(2x) (square is a rectangular curve serving as a
“box” envelope for the sine wave).

Note that the input space is R, so each
pattern x1, x2, . . . , xN in (6) is in fact a scalar. Begin by creating a column vector containing the points (patterns) where you want to evaluate your function.

Let’s limit the regression to the interval [0, 2π]. Sample this interval starting
from 0 with the step size of 0.1 and calculate the values of the two functions
at the these points to obtain the corresponding training sets. The testing sets
could be generated analogously with sampling starting from 0.05 and the same
step size.

For a varying number of RBF nodes, please place them by hand in
the input space according to your judgement, and set the same variance to each
node. Next, apply your batch learning algorihtm on your training set to adjust
the output weights and test accordingly on the hold-out set.

For both functions
(studied indpenedently), please consider and discuss the following issues (which
involve running the suggested experiments):
• Try to vary the number of units to get the absolute residual error below 0.1,
0.01 and 0.001 in the residual value (absolute residual error is understood
as the average absolute dierence between the network outputs and the
desirable target values). Please discuss the results, how many units are
needed for the aforementioned error thresholds?

• How can you simply transform the output of your RBF network to reduce the residual error to 0 for the square(2x) problem? Still, how many
units do you need? In what type of applications could this transform be
particularly useful?

3.2 Regression with noise
Now please add zero-mean Gaussian noise (with the variance of 0.1) to both
training and testing datasets for the function sin(2x). Apply batch learning, as
in the previous part of the assignment,and compare with on-line learning with
delta rule (given that the data points are randomly shued in each epoch). As
before, please place the RBF units by hand wherever you consider suitable and
x the same width for them.

Analogously, iterate over a reasonable number of
RBF nodes and nd the overall network congurations (independently for the
two training modes) that result in the absolute residual error below 0.1, 0.01
and 0.001 on the testing datasets.

Please consider the following points and make
the requsted analyses:
• Compare the two learning approaches in terms of the number of epochs
and the number of nodes needed to obtain the requested performance
levels. How does it compare, particularly with respect to the number of
RBF nodes, to the batch mode training of sin(2x) without noise, as in the
previous assignment task? Compare the rate of convergence for dierent
learning rates,

• How important is the positioning of the RBF nodes in the input space?
What strategy did you choose? Is it better than random positioning of the
RBF nodes? Please support your conclusions with quantitative evidence
(e.g., error comparison).

• What are the main eets of changing the width of RBFs?
• How does the rate of convergence change with dierent values of eta?

• Please compare your optimal RBF network trained in batch mode with a
two-layer perceptron trained with backprop (also in batch mode). To nd
a suitable candidate two-layer perceptron architecture try to use the same
overall number of hidden units as in the RBF network and distribute them
in dierent ways over the two hidden layers (use the existing libraries of
your choice for these simulations).

Please remember that generalisation
performance and training time are of greatest interest.

3.3 Competitive learning for the initialisation of RBF units
Now we will take a look at the problem of placing the RBFs in input space.
We will use a version of Competitive Learning (CL) for Vector Quantization.

The simple Competitive Learning algorithm we use here can only adjust the
positions of the RBF units without adjusting the width of the units. Therefore
you will have to make these adjustment yourselves based on the distribution
of data around the cluster centers found with the simple CL algorithm. At
each iteration of CL a training vector is randomly selected from the data. The
closest RBF unit (usually called the winning unit) is computed, and this unit is

updated, in such a way that it gets closer to the training vector. The other units
may or may not (depending on the version of CL used) be moved towards it too,
depending on distance. This way the units will tend to aggregate in the clusters
in the data space.

Please, couple the CL-based approach to RBF network
initilisation with the aforementioned delta learning for the output weights.

• Compare the CL-based approach with your earlier RBF network where you
manually positioned RBF nodes in the input space. Use the same number
of units (it could be the number of unicts that allowed you to lower the
absolute residual error below 0.01).

Make this comparison for both noisefree and noisy approximation of sin(2x). Pay attention to convergence,
generalisation performance and the resulting position of nodes.

• Introduce a strategy to avoid dead units, e.g. by having more than a single
winner. Choose an example to demonstrate this eect in comparison with
the vanilla version of our simple CL algorithm.

• Congure an RBF network with the use of CL for positioning the RBF
units to approximate a two-dimensional function, i.e. from R
2
to R
2
. As
training examples please use noisy data from ballistical experiments where
inputs are pairs: <angle, velocity> and the outputs are pairs: <distance, height>. There are two datasets available: ballist for training
and balltest for testing.

First thing to do is to load the data and then
train the RBF network to nd a mapping between the input and output values. Please be careful with the selection of a suitable number of
nodes and their initialisation to avoid dead-unit and overtting problems.

Report your results and observations, ideally with the support of illustrations, and document your analyses (e.g., inspect the position of units in
the input space).

4 Assignment – Part 2

4.1 Topological Ordering of Animal Species
The SOM algorithm can be used to assign a natural order to objects characterized only by a large number of attributes. This is done by letting the SOM
algorithm create a topological mapping from the high-dimensional attribute
space to a one-dimensional output space.

As sample data, please use a simple database of 32 animal species where
each animal is characterized by 84 binary attributes. The data is in the le
animals.dat. This le denes the 32×84 matrix props where each row contains
the attributes of one animal. The data are organised in row-by-row manner.

There is also a le animalnames.dat with the names of the animals in the same
order. This vector should only be used to print out the nal ordering in a more
readable format. These 84 values serve as input and 100 nodes arranged in a
one-dimensional topology, i.e. in a linear sequence, constitute the output.

Train the SOM network by showing the attribute vector of one animal at a
time. The SOM algorithm should now be able to create a mapping onto the 100
output nodes such that similar animals tend to be close while dierent animals
tend to be further away along the sequence of nodes. In order to get this onedimensional topology, the network has to be trained using a one-dimensional
neighbourhood.

In particular, your task is to write the core algorithm. Use a weight matrix
of size 100 × 84 initialized with random numbers between zero and one. Use an
outer loop to train the network for about 20 epochs, and an inner loop which
loops through the 32 animals, one at a time.

For each animal you will have to
pick out the corresponding row from the props matrix. Then nd the row of the
weight matrix with the shortest distance to this attribute vector (p). Note that
you cannot use a scalar product since the attribute vectors are not normalized.

Therefore you have to take the dierence between the two vectors and calculate
the length of this dierence vector. Once you have the index to the winning
node, it is time to update the weights. Update the weights so that they come a
bit closer to the input pattern. A suitable step size is 0.2.

Note that only weights
to the winning node and its neighbours should be updated. The neighbours are
in this case the nodes with an index close to that of the winning one. You
should start with a large neighbourhood and gradually make it smaller.

Make
the size of the neighbourhood depend on the epoch loop variable so that you
start with a neighbourhood of about 50 and end up close to one or zero.

Finally,
you have to print out the result, i.e. the animals in a natural order. Do this
by looping through all animals once more, again calculating the index of the
winning output node. Save these indices in a 32 element vector pos. By sorting
this vector we will get the animals in the desired order. Check the resulting
order.

Does it make sense? If everything works, animals next to each other
in the listing should always have some similarity between them. Insects should
typically be grouped together, separate from the dierent cats, for example.

4.2 Cyclic Tour
In the previous example, the SOM algorithm in eect positioned a one-dimensional
curve in the 84-dimensional input space so that it passed close to the places
where the training examples were located. Now the same technique can be used
to layout a curve in a two-dimensional plane so that it passes a set of points.

In fact, this can be interpreted as a variant of the travelling salesman problem.
The training points correspond to the cities and the curve corresponds to the
tour. SOM algorithm should be able to nd a fairly short route which passes
all cities.

The actual algorithm is very similar to what you implemented in the previons
task. In fact, you might be able to reuse much of the code. The main dierences
are:
• The input space has two dimensions instead of 84. The output grid should
have 10 nodes, corresponding to the ten cities used in this example.

• The neighbourhood should be circular since we are looking for a circular
tour. When calculating the neighbours you have to make sure that the
rst and the last output node are treated as next neighbours.

• The size of the neighbourhood must be smaller, corresponding to the
smaller number of output nodes. It is reasonable to start with a neighbourhood size of 2 and then change it to 1 and nally zero.

• When presenting the result, it is better to plot the suggested tour graphically than to sort the cities.
The location of the ten cities is dened in the le cities.dat which denes
the 10 × 2 matrix city. Each row contains the coordinates of one city (value
between zero and one).
Please plot both the tour and the training points. Give your interpretation.

4.3 Data Clustering: Votes of MPs
The le votes.dat contains data about how all 349 members of the Swedish
parliament did vote in the 31 rst votes during 20042005. There are also three
additional les mpparty.dat, mpsex.dat and mpdistrict.dat with information
about the party, gender and district of each member of parliament (MP). Finally,
there is a le mpnames.txt with the names of the MPs. Your task is to use the
SOM algorithm to position all MPs on a 10 × 10 grid according to their votes.

By looking at where the dierent parties end up in the map you should be
able to see if votes of the MPs actually reect the traditional leftright scale,
and if there is a second dimension as well. You should be able to see which
parties are far apart and which are close.

By looking at the distribution of female and male MPs you could get some
insight into whether MPs tend to vote dierently depending on their gender.
You can also see if there is a tendency for MPs from dierent districts to vote
systematically dierent.

The le votes.dat denes a 349 × 31 matrix votes. Data are organised in
row-by-row manner. Each one of 349 rows corresponds to a specic MP and each
one of 31 columns to a specic vote. The elements are zero for a no-vote and
one for a yes-vote. Missing votes (abstrained or non-present) are represented
as 0.5.

You should use the SOM algorithm to nd a topological mapping from the
31-dimensional input space to a 10 × 10 output grid. The network should be
trained with each MPs votes as training data. If all works well, voting patterns
that are similar will end up close to each other in the 10 × 10 map.

Please display the results with respect to dierent attributes (i.e. party,
gender, district) and describe the results, provide your interpretation.
Good luck!