DD2437 Lab assignment 1 Learning and generalisation in feed-forward networks — from perceptron learning to backprop solution

$25.00

Original Work ?
Category: You will Instantly receive a download link for .ZIP solution file upon Payment

Description

5/5 - (1 vote)

1 Introduction

This exercise is concerned with supervised (error-based) learning approaches for
feed-forward neural networks, both single-layer and multi-layer perceptron.

1.1 Aim and objectives

After completion of the lab assignment, you should be able to
• design and apply networks in classification, function approximation and
generalisation tasks
• identify key limitations of single-layer networks
• configure and monitor the behaviour of learning algorithms for single- and
multi-layer perceptrons networks
• recognise risks associated with backpropagation and minimise them for
robust learning of multi-layer perceptrons.

1.2 Scope
In this lab you will implement single- and multi-layer perceptrons with focus
on the associated learning algorithms. You will then study their properties by
means of simulations. Since the calculations are naturally formulated in terms
of vectors and matrices, the exercise was originally conceived with Matlab1 in
mind. However, you are free to choose your own programming/scripting language, environment etc. In the first part you will be asked to develop all the code
1It is also possible to use Octave, the free version of Matlab.

from scratch whereas in the second part you can use one of the recommended
libraries, i.e. NN toolbox in Matlab, scikit-learn in Python or TensorFlow), to
examine more advanced aspects of training multi-layer perceptrons with backpropagation. If you prefer to exploit other libraries, software for the second part
of the assignment, please let us know in advance.

In the first part, the focus will be on two learning algorithms: the Delta rule (for a
single-layer perceptron) and the generalised Delta rule for two-layer perceptron.
The generalised Delta rule is also known as the error backpropagation algorithm
or simply “backprop”. This is one of the most common and generic supervised
learning algorithms for neural networks and it stems from the concept of gradient
descent. In this exercise you will have the opportunity to use it for classification,
data compression, and function approximation.

In the second part of the lab assignment, you will work with multi-layer perceptrons to sole the problem of chaotic time series prediction. In this task you
will design, train, validate (including model selection) and evaluate your neural
network with the ambition to deliver a robust solution with good generalisation
capabilities. Since you will have to rely on more advanced features of neural
network training and evaluation, you will be asked to rely on the existing libraries (as mentioned above, NN toolbox in Matlab, sckit-learn in Python and
TensorFlow are recommended; there is a lot of solid documentation available to
familiarise yourselves wth these tools).

2 Background

2.1 Data Representation
The data can be e↵ectively represented in matrices (collection of vectors). Since
this is a supervised learning approach, our traning data should consist of input
patterns (vectors) and the associated output patterns, often called labels (e.g.,
scalar values for classification and regression).

There are two options to perform
training – sequential on a sample-by-sample basis and batch. In this lab we
first focus on batch learning. This means that all patterns in the training set
will be used as a whole at the same time instead of stepping through them
one by one and updating weights successively for each sample (input pattern
with its associated label/output).

Batch learning is better suited for a matrix
representation and is significantly more e↵ective given built-in functions for
quick matrix operations in most programming/scripting languages. In batch
learning, each use of the whole set of available training patterns is commonly
referred to as an epoch and the entire training process involves many iterative
epochs. By a suitable choice of representation, an epoch can be performed with
just a few matrix operations.

Further, in problems where binary representation (0/1) is inherent, it is convenient sometimes and practical to rely instead on a symmetric (1/1) represen2
tation of the patterns. This representation however is not intuitive for visualisation. Therefore, for visual inspection, you may choose to write a function that
transforms symmetric -1/+1 pattern into a binary 0/1 pattern.

The input patterns (vectors) as well as their corresponding targets/labels (predominantly scalar values) can be represented as columns in two matrixes, X and
T, respectively. With this representation, the XOR problem would for instance
be described by
X =
 1 1 1 1
1 1 11
T = ⇥ 111 1 ⇤
If we read the matrices column-wise, we get the pattern (1, 1) to be associated
with the output 1, and the pattern (1, 1) with the output 1 etc.

A single-layer perceptron sums the weighted inputs, adds the bias term and
produces the thresholded output. If you have more than one output, you have to
have one set of weights for each output. These computations become very simple
in matrix form.

Make sure however that you account for the bias term by adding
an extra input signal whose value always is one (and a weight corresponding to
the bias value, as shown in the lecture). In the XOR example we thus get an
extra column:
Xinput =
2
4
1 1 1 1
1 1 11
1 1 11
3
5 T = ⇥ 111 1 ⇤

The weights are stored in matrix W with as many columns as the dimensionality
of the input patterns and with the number of rows matching the number of the
outputs (dimensionality of the output).

The network outputs corresponding to
all input patterns can then be calculated by a simple matrix multiplication
followed by thresholding at zero (since the bias has been already taken into
account in the extra column of the weight matrix, provided that an extra entry
with the constant value 1 was also included in the formation of the inputs, as
explained earlier and discussed in the lecture).

Learning with the Delta rule
aims, with the representation selected, to find the weights W that give the best
approximation:
W ·
2
4
1 1 1 1
1 1 11
1 1 11
3
5 ⇡ ⇥ 111 1 ⇤
Unfortunately the XOR problem is one of the classical problems that a singlelayer perceptron cannot solve.

2.2 Implementation of the Delta rule
Store the training data in variables patterns and targets. As discussed above,
add an extra row to the input patterns with ones corresponding to the extra
bias terms in the weight matrix.

The Delta rule can be written as:
wj,i = ⌘xi
X
k
wj,kxk tj
!
where ¯x is the input pattern, t
¯ is the wanted output pattern and wj,i is the
connection xi to tj . This can be more compactly written in matrix form:
W = ⌘(Wx¯ t
¯)¯xT

The formula above describes how the weights should be changed based on one
training pattern (and its matching target label). To get the total weight change
for the entire epoch, i.e. accounting for all training patterns, the weight update
contributions from all patterns should be summed. Since we store the patterns as
columns in X and T, we get this sum “for free” when the matrixes are multiplied.

The total weight change from a whole epoch can therefore be written in this
compact way:
W = ⌘(W X T)XT
Write your code so that the learning according to the formula above can be flexibly repeated epochs times (where 20 is a suitable number for a low-dimensional
perceptron). Try to avoid loops as much as possible at the cost of powerful matrix
operations (especially multiplications). Make sure that your code works for arbitrary sizes of input and output patterns and the number of training patterns.

The step length ⌘ should be set to some suitable small value like 0.001. Note: a
common mistake when implementing this is to accidentally orient the matrixes
wrongly so that columns and rows are interchanged. Make a sketch on paper
where you write down the sizes of all components starting by the input and
how the dimensionality propagatesto the weights to the output. This will be
particularly important in the next part of the lab with a two-layer perceptron.

Before the learning phase can be executed, the weights must be initialised (have
initial values assigned). The normal procedure is to start with small random
numbers drawn from the normal distribution with zero mean. Construct a function to create an initial weight matrix by using random number generators
built into programming/scripting languages. Note that the matrix must have
matching dimensions.

2.3 Implementation of a two-layer perceptron
The focus is here on the implementation of the generalised Delta rule, more
commonly known as backprop. You are going to use it in several di↵erent experiments, so it is worth making this general. Specifically, please make sure that
the number of nodes in the hidden layer easily can be varied, for instance by
changing the value of a global parameter. Also let the number of iterations and
the step length be controlled in this way.

For multi-layer feed-forward networks, non-linear transfer functions should be
used, often denoted ‘. Commonly in classical multi-layer perceptrons (especially
in shallow architectures) one chooses a function with the derivative that is simple
to compute, e.g.
‘(x) = 2
1 + ex 1
which has the derivative
‘0
(x) = [1 + ‘(x)][1 ‘(x)]

Note that it is advantageous to express the derivative in terms of ‘(x) itself
since this value, used in the backward pass of the backpropagation learning
algorithm, has to be computed anyway in the forward pass. This way we can
save on the extra computations that would otherwise have to be performed to
calculate derivatives (as discussed in the lecture, we get these derivatives in the
scenario descibed above almost for free”).

In backprop, each epoch consists of three parts. First, the so called forward
pass is performed. In this the activities of the nodes are computed layer for
layer. Secondly, there is the backward pass when an error signal is computed for each node. Since the -values depend on the -values in the following
layers, this computation must start in the output layer and successively work
its way (propagate) backwards layer by layer (thereby giving rise to the name –
backpropagation).

2.3.1 The forward pass
Let xi denote the activity level in node i in the output layer and let hj be the
activity in node j in the hidden layer. The output signal hj now becomes
hj = ‘(h⇤
j )
where h⇤
j denotes the summed input signal to node node j, i.e.
h⇤
j = X
i
wj,ixi

Thereafter the same happens in the next layer, which eventually gives the final
output pattern in the form of the vector ¯o.
ok = ‘(o⇤
k)
5
o⇤
k = X
j
vk,jhj
Just as for the one layer perceptron, these computations can eciently be written in matrix form.This also means that the computations are performed simultaneously over all the training patterns.
H = ‘(W X)
O = ‘(V H)

The transfer function ‘ should here be applied to all elements in the matrix,
independently of each other.
We have so far omitted a small but important point, the so called bias term.
For the algorithm to work, we must add an input signal in each layer which has
the value one2. In our case the matrixes X and H must be extended with a row
of ones at the end.

In Matlab, the forward pass can be expressed like this:
hin = w * [patterns ; ones(1,ndata)];
hout = [2 ./ (1+exp(-hin)) – 1 ; ones(1,ndata)];
oin = v * hout;
out = 2 ./ (1+exp(-oin)) – 1;
Here we use the variables hin for H⇤, hout for H, oin for O⇤ and out for O.
Observe the use of the Matlab operator ./ which denotes element wise division
(in contrast to matrix division). The corresponding operator .* has been used
already to get element wise multiplication.

2.3.2 The backward pass
The backward pass aims at calculating the generalised error signals that are
used in the weight update. For the output layer nodes, is calculated as the
error in output multiplied with the derivative of the transfer function (‘0
), thus:

(o)
k = (ok tk) · ‘0
(o⇤
k)

To compute in the next layer, one uses the previously calculated (o)
:

(h)
j =
X
k
vk,j
(o)
k
!
· ‘0
(h⇤
j )

2Some authors choose to let the bias term go outside the sum in the formulas, but this
leads to the e↵ect that it must be given special treatment all the way through the algorithm.
Both the formulas and the implementation becomes simpler if you make sure that the extra
input signal with the value 1 is added for each layer.

It should be expressed in matrix form:
(o) = (O T) ‘0
(O⇤)
(h) = (V T (o)
) ‘0
(H⇤)
(where denotes element wise multiplication).
As an example, the corresponding Matlab implementation could be coded in
the following way:
delta_o = (out – targets) .* ((1 + out) .* (1 – out)) * 0.5;
delta_h = (v’ * delta_o) .* ((1 + hout) .* (1 – hout)) * 0.5;
delta_h = delta_h(1:Nhidden, :);

The last line only has the purpose of removing the extra row that we previously
added to the forward pass to take care of the bias term. We have here assumed
that the variable Nhidden contains the number of nodes in the hidden layer.

2.3.3 Weight update
After the backward pass, it is now time to perform the actual weight update.
The formula for the update is :
wj,i = ⌘xi
(h)
j
vk,j = ⌘hj
(o)
k
which we as usual convert to matrix form
W = ⌘(h)
XT
V = ⌘(o)
HT

As discussed in the lecture, to facilitate the convergence of our backprop learning algorithm a so-called momentum term can be added. This implies that the
weights are not modified exclusively with with the update values from above,
but with a moving average taking into account previous update(s) as well. This
approach suppresses fast variations and allows the use of a larger learning rate.

All in all, it balances out the contribution of the larger learning rate promoting
faster convergence with the momentum slowing-down the process (exploration
vs exploitation in the search through the weight space). A scalar factor ↵ controls how much the old weight update vector contributes to the new update. A
suitable value of ↵ is often 0.9. The new update rule then becomes (in matrix
form):
⇥ = ↵⇥ (1 ↵)(h)
XT
= ↵ (1 ↵)(o)
HT
7
W = ⌘⇥
V = ⌘

In Matlab, it could be implemented as follows:
dw = (dw .* alpha) – (delta_h * pat’) .* (1-alpha);
dv = (dv .* alpha) – (delta_o * hout’) .* (1-alpha);
W = w + dw .* eta;
V = v + dv .* eta;
We have now gone through all the central parts of the algorithm. What remains
is to put all parts together. Do not forget that the forward pass, the backward
pass and the weight update should be performed for each epoch. For this, a forloop can preferentially be used to successively get better and better weights in
the iteration over epochs.

2.4 Monitoring the learning process and evaluation
Monitoring the process of learning for multi-layer perceptrons is not as simple
as for a single-layer perceptron, which could be done by drawing the line of
separation – decision boundary. For multi-layer networks we commonly rely on
the output error as a probe for the advancement of the learning process.

It is
a common practice therefore to plot learning curves with the error estimated
either by the mean square error or, in classification tasks, as the total number
or proportion of misclassifications. Such learning curves illustrate the progress
made over consecutive epochs (the error is usually estimated for the entire epoch,
i.e. across all the training patterns).

2.5 Generalisation, regularisation, validation for robust
learning
In the second part of the lab assignment, more advanced concepts will be introduced to make the development of a multi-layer perceptron and particularly the
learning process more robust. In essence, the objective is to improve generalisation capabilities of your neural network.

As discussed in the lecture, there are a
number of approaches that practitioners adopt (here in the context of shallow
networks). In this more advanced part of the lab assignment, we will focus on
the problem of model selection (how the architecture is decided), validation,
estimation of the generalisation error and regularisation.

These concepts are
covered in detail in the lecture. Here, I would just like to draw your attention
to the problem of validation and estimation of the truegeneralisation error, as
it often involves sampling your data.

In short, unlike in the first part of the
assignment with the primary focus on weight updates, in the second part you
will have to split your available data in the development of your multi-layer
perceptron to conduct

• training: updating weights in the learning process,
• validation: monitoring the process of learning and providing basis for a
range of developer’s decisions including model selection, and
• testing: the final/ultimate evaluation of the accuracy and generalisation
power of your network on a separate (unseen) data subset.

3 Assignment – Part I

3.1 Classification with a single-layer perceptron
3.1.1 Generate linearly-separable data
In the first place, please generate some data that can be used for binary classification (two classes). To simplify visual inspection, let’s work with two-dimensional
data. To start with, please draw two sets of points in 2D from multivariate normal distribution. Choose parameters yourselves to make sure that the two sets
are linearly separable (so the means of the two distributions should be suciently di↵erent).

You can generate 100 points per class and shu✏e samples so
that in your dataset you would not have just two concatenated blocks of samples
from the same class. Although this reordering (shu✏ing) does not matter for
bacth learning, it has implications for the speed of convergence for sequential
(on-line) learning, where updares are made on a sample-by-sample basis. Please
plot your points with di↵erent colours per class.

3.1.2 Perform classification with a single-layer perceptron and analyse the results
Apply and compare both perceptron learning and Delta learning rules on the
generated dataset. Please try also to compare sequential with a batch learning
approach. Comparisons can be made using some evaluation metrics that could be
the number or ratio of misclassified examples at each epoch (iteration through
the entire dataset).

How quickly do the algorithms converge? Please plot the
learnign curves for each variant of learning. You could also visualise the learning
process by plotting a separating line (decision boundary) after each epoch of
training (for that you could generate a sort of animation; you are not required
though to demonstrate this animation to the teaching assistant).

3.1.3 Classification of samples that are not linearly separable
Perform the same task as described above, i.e. including data generation, percpetron learning (both perceptron and Delta rules) and evaluation, for data that
are not linearly separable. The easiest way to synthesise such data is to make
the means of the two multivariate normal distributions mentioned above more
similar and/or increase the spread (variance).

As a result, you should see that
the two clouds of points (corresponding to the two classes) overlap when you
plot the samples. You can control the amount of overlap by the parameters of
the distributions. Next you can train the perceptron to classify the data and
monitor its performance. Please make similar analysis as in the case of linearlyseparable data. Pay special attention to the manifestations of non-convergent
learning with the use of a classicial perceptron learning rule.

3.2 Classification and regression with a two-layer perceptron
3.2.1 Classification of linearly non-separable data
Now we are ready to return to the previous problem of linearly non-separable
patterns. Test a two-layer perceptron trained with backprop and verify that it
can solve the problem (separate the two classes).

Modify the number of hidden nodes and demonstrate the e↵ect the size of the hidden layer has on the
performance (both the mean squared error and the number/ratio of misclassifications). How many hidden nodes do you need to perfectly separate the available
data (if manageable at all given your data randomisation)? In parallel with the
evaluation on the training data (data that you use to calculate weight updates
using backprop), please make also evaluation on a new, previously unseen, test
dataset.

To this end, please generate from your random distributions 50 new
samples for each class and use them only to calculate the error (mean squared
error or the ratio of misclassifications) at di↵erent stages/epochs of learning.

Make sure that you do not use them in the learning process. You can then
study the following questions:
• How do the training and test learning curves compare?
• How do the training and test classification results depend on the size of
the hidden layer?
• How many epoch iterations do you need for convergence?
• Is there any di↵erence between batch and sequential learning approaches?

3.2.2 The encoder problem
The encoder problem is a classical problem for how one can force a network to
find a compact coding of sparse data. This is done by using a network with a
hour-glass shaped topology, i.e. the number of hidden nodes are considerably
fewer than the dimensionality of the data. The network is trained with identical
input and output patterns, which forces the network to find a compact representation in the hidden layer.

For this reason we often refer to such networks
as autoencoders (finding a new representation basis or encoding through autoassociation, i.e. input=output).
We will study a simple autoencoder with 8–3–8 feed-forward architecture (twolayer perceptron). The data are originally represented using ¨one out of n”coding,
i.e. only one input variable is active (=1) and the rest of input variables are in
an inactive state, here: -1. For example:
⇥ 1 1 1 1 1 1 1 1 ⇤T .

There are eight such patterns in total. By letting the hidden layer have only
three nodes, we can force the network to produce a representation where the
eight-dimensional input patterns are represented in the three-dimensional space
(spanned by the activations of the hidden nodes). Your task is to study what
type of representation is created in the hidden layer.

Please, use your implementation of the generalised Delta rule to train the
network until the learning converges (it does not necessarily have to imply that
the mean squared error is 0 but that the rounded outputs match the corresponding inputs). Does the network always succeed in doing this? How does
the internal code look, what does it represent? For that, you can inspect the
activations of the hidden layer corresponding to input patterns. You could also
examine the weight matrix for the first layer. Can you deduce anything from
the sign of the weights?

3.3 Function approximation
So far we have used the perceptron’s ability to classify data or find low-dimensional
representations. Multi-layer perceptrons are known however for their ability to
approximate an arbitrary continuous function. We will here study how one can
train a two-layer perceptron network to approximate a function based on available input-output data examples. To enable visual inspection, the task is formulated for a function of two variables with a real value as output, f: R2 ! R.

3.3.1 Generate function data
As the function to approximate we choose the well known bell shaped Gauss
function3.
f(x, y) = e(x2+y2)/10 0.5
3Use the interval 0.5 to +0.5 to make sure that the output node will produce the values
needed

For example, the following lines of Matlab code create the input vectors x and
y as well as the corresponding outputs z, and make a 3D plot.
x=[-5:0.5:5]’;
y=[-5:0.5:5]’;
z=exp(-x.*x*0.1) * exp(-y.*y*0.1)’ – 0.5;
mesh(x, y, z);

The form of storage we now have for input and output is perfect for visualizing
graphically in Matlab, but to be able to use the patterns as training data, i.e.
to have all pair combinations of x and y dimensions, they must be changed to
pattern matrices. In Matlab the functions reshape and meshgrid can be used
for that purpose. The following commands will put together the two matrixes
patterns and targets that are needed for training.

targets = reshape (z, 1, ndata);
[xx, yy] = meshgrid (x, y);
patterns = [reshape(xx, 1, ndata); reshape(yy, 1, ndata)];
(ndata is here the number of patterns, i.e. the product of the number of element
in x and in y.)

3.3.2 Train the network and visualise the approximated function
Now put together all parts to get a program that generates function data. This
will during the learning after each epoch show how the function approximation
looks like. When all works, you should see an animated function that successively becomes more and more similar to a Gaussian. Experiment with di↵erent
number of nodes in the hidden layer to get a feeling for how this parameter
a↵ects the final representation.

The network’s approximation can be visualised by performing the transformations corresponding to those in the previous section in the reverse direction.
Matlab commands to do this are:
zz = reshape(out, gridsize, gridsize);
mesh(x,y,zz);
axis([-5 5 -5 5 -0.7 0.7]);
drawnow;

Here we assume that gridsize is the number of elements in x or y (e.g. ndata =
gridsize · gridsize). The variable out is the output produced in the forward
pass when all the training data patterns (samples) are presented as input data.

(To create an animation of the learning process in Matlab, function drawnow
can be used.)

3.3.3 Evaluate generalisation performance
An important property of neural networks is their ability to generalise. This
means producing sensible output data even for input data samples that have
not been part of the training. We will not modify the experiment above but
only train the network with a limited number of available data points. We will
still look at the approximation ability at all points as before, thi time we will
focus on the approximation error (mean squared error).

To subsample data for training, one can make a random permutation of the
vectors patterns and targets and choose only n first patterns (and examine
di↵erent values of n) for training. The program will then need to do two di↵erent
forward passes; one for the training points and one for all points. Only the first
pass is associated with an update of the weights. The result of the second pass is
used to see how well the network generalises. In our case, a good approximation
means that the network can recreate the whole function even though it has
learnt based on only a few example samples (training data points).

Test with n = 1 up to n = 25. Vary the number of nodes in the hidden layer and
try to observe any trends. What happens when you have very few (less than 5)
or very many (more than 20) hidden nodes? Can you explain your observations?
A non-mandatory study, but interesting and didactical, could be to examine the
behaviour of your two-layer perceptron with a varying number of hidden nodes
in the presence of Gaussian noise added to z variables (function outputs) in the
test data subset.

4 Assignment – Part II

In the second part of the assignment you will develop a multi-layer perceptron
network for chaotic time-series prediction. In particular, you will run a benchmark test with Mackey-Glass time series, which is a solution to the following
di↵erential equation (with real parameters and , and integer n and delay ⌧ )
dx
dt = x(t ⌧ )
1 + xn(t ⌧ ) x, , ,n> 0.

Let’s set =0.2 and =0.1,n=10 and ⌧>17, e.g. ⌧=25. The equation can be
discretised and solved with Euler’s method:
x(t + 1) = x(t) + 0.2x(t 25)
1 + x10(t 25) 0.1x(t).
You can use this iterative formulation with x(0)=1.5 (x(t)=0 for t<0).

If you are interested in chaotic time series as such, I can recommend that you
explore a large body of resources freely available on the web.

4.1 Data
Please configure the network to predict x(t+5) from four past values of the time
series, that is: x(t-20), x(t-15), x(t-10), x(t-5) and x(t). This is an embedded
time-lagged representation, which is often used with feed-forward networks operating on time series. Let’s pick 1200 points from t=301 to 1500, and use them
for training, validation and testing. In Matlab, assuming that the time series is
stores in a long row vector x, the inputs and outputs could be defined as follows:
t = 301:1500;
input = [x(t-20); x(t-15); x(t-10); x(t-5); x(t)];
output = x(t+5);

Important: Please check how functions in the neural network library of your
choice handle data (in terms of matrix dimensionality) and how they account
for the bias term.

As mentioned above, the selected 1200 samples should now be divided into
three consecutive non-overlapping blocks for training, validation and testing. It
is expected that the last 200 samples are used for the final evaluation (test),
and how exactly you split the rest of available data into training and validation
is your own decision (though the subsets should constitute consecutive blocks
since it is a time series). As a measure of performance, the mean squared error
is recommended.

4.2 Network configuration
In the library of your choice, please construct a multi-layer perceptron network
with five inputs and one output to handle the time series prediction problem.
Make sure that you configure your data in an appropriate manner, consistently
with the instructions in the previous section.

Set up the training process with
a batch backprop algorithm and early stopping to control the duration of learning (number of epochs), thereby preventing from overfitting, based on the
error estimate on the hold-out validation data subset. If you identify that early
stopping leads to premature ending of the learning process, try to fix it or in the
worst case scenario – remove it and define an alternative criterion for stopping.

In addition, you will be asked to make use of some regularisation techniques,
e.g. weight decay, to further boost the generalisation performance. Please identify suitable library functions and decide which regularisation approach you are
going to adopt, motivate your choice in the end.

When you design your network, please parameterise the number of hidden layers
and the number of nodes. You will be asked to evaluate both two- and threelayer perceptrons with varying number of nodes in the hidden layers, max 8 per
layer (please note that there is one hidden layer for a two-layer network and two
hidden layers for a three-layer perceptron).

Also, parameterise the regularisation method and control/monitor the speed of convergence – keep in mind that
inappropriate choice of the learning rate can lead to premature convergence or
to excessively long simulations (even without the e↵ect of convergence).

4.3 Simulations and evaluation
In this part of the lab assignment you are asked to complete the tasks listed
below in the subsections and address the corresponding questions
4.3.1 Two-layer perceptron for time series prediction – model selection, regularisation and validation
Use the available data to design and evaluate a two-layer perceptron network
for Mackey-Glass time series prediction, which involves

1. Data generation and plotting the resulting time series.

2. Training a neural network for di↵erent configurations (e.g., the number of
hidden nodes, strength of regularisation etc.) with the use of early stopping
and a regularisation technique of your own choice.

3. Validation of di↵erent network configurations and estimation of the generalisation error on a hold-out validation set, comparing di↵erent models
and selecting one for further evaluation. What is the e↵ect of regularisation
strength and the number of hidden nodes on the validation performance?

4. Final evaluation of the selected model on a test set – the conclusive estimate of the generalisation error on the unseen data subset. Plotting these
test predictions along with the known target values (and/or plotting the
di↵erence between the predictions made by your multi-layer perceptron
and the corresponding true time series samples).

4.3.2 Three-layer perceptron for noisy time series prediction – generalisation
Propose a few configurations of a three-layer perceptron taking the best twolayer achitecture you have identified in the previous task as a starting point
(basically, use the number of hidden nodes in the first hidden layer that led
to the best generalisation performance in the first task).

Run similar tests and
analyses to those in the first task with the exception that this time you are
requested to add zero-mean Gaussian noise to all of your data. The following
subtasks will help you understand the implications of noise for generalisation.

1. Examine how the validation prediction performance (estimated on a holdout set) depends on the number of nodes in the second hidden layer for
di↵erent amount of noise (experiment with three values of the std dev of
the additive Gaussian noise, = 0.03, 0.09 and 0.18). Are there any strong
trends or particular observations to report?

2. What is the e↵ect of regularisation? How does the regularisation parameter
interact with the amount of noise (as the amount of noise increases, are
there any incentives to change the regularisation paremeter)?

3. For each configuration of noise choose the best three-layer model. Then
compare the selected models with the two-layer network in the first task
(trained here from scratch on the corresponding noisy data) in terms of
generalisation error estimated on the evaluation test set. Is any of the two
networks superior irrespective of the amount of noise? Discuss the e↵ect
of noise on the generalisation performance.

4. What is the computation cost (time) of backprop learning involved in
scaling the network size (from two- to three-layer perceptron and for threelayer perceptrons with varying number of hidden nodes)?

Please present your key findings in tables and figures. Importantly, please
bear in mind that your neural network simulation are stochastic in nature with
di↵erent sources of variability, including random initialisation of weights and
random noise added in Part II. Make sure that you account for these factors in
your evaluation and analysis.