# DD2424 – Assignment 2 solution

\$25.00

Original Work ?

## Description

5/5 - (1 vote)

In this assignment you will train and test a two layer network with multiple outputs to classify images from the CIFAR-10 dataset. You will train
the network using mini-batch gradient descent applied to a cost function
that computes the cross-entropy loss of the classifier applied to the labelled
training data and an L2 regularization term on the weight matrix.

The overall structure of your code for this assignment should mimic that
from Assignment 1. You will have more parameters than before and you
will have to change the functions that 1) evaluate the network (the forward
pass) and 2) compute the gradients (the backward pass). We will also be
paying more attention to how to search for good parameter settings for the
network’s regularization term and the learning rate. Welcome to the nitty
gritty of training neural networks!

## Background 1: Mathematical background

The mathematical details of the network are as follows. Given an input
vector, x, of size d × 1 our classifier outputs a vector of probabilities, p
(K × 1), for each possible output label:
s1 = W1x + b1 (1)
h = max(0, s1) (2)
s = W2h + b2 (3)
p = SOFTMAX(s) (4)
where the matrix W1 and W2 have size m × d and K × m respectively and
the vectors b1 and b2 have sizes m × 1 and K × 1. SOFTMAX is defined as
SOFTMAX(s) = exp(s)
1
T exp(s)
(5)

The predicted class corresponds to the label with the highest probability:
k
∗ = arg max
1≤k≤K
{p1, . . . , pK} (6)

We have to learn the parameters W1, W2, b1 and b2 from our labelled
training data. Let D = {(xi
, yi)}
n
i=1, with each yi ∈ {1, . . . , K} and xi ∈ R
d
,
represent our labelled training data.

In the lectures we have described how to
set the parameters by minimizing the cross-entropy loss plus a regularization
term on W1 and W2. To simplify the notation we group the parameters of
1
x s1 h s p
W1 b1 W2 b2
W1x + b1 max(0, s1) W2h + b2
softmax(s) x s1 h s p l J
W1 b1 W2 b2 y λ
r
W1x + b1 max(0, s1) W2h + b2
softmax(s) − log(y
T p) l + λr
kW1k
2 + kW2k

a) Classification function b) Cost function
Figure 1: For this assignment the computational graph of the classification function
applied to an input x and the computational graph of the cost function applied to
a mini-batch of size 1.

the model as Θ = {W1, W2, b1, b2}. The cost function is
J(D, λ, Θ) = 1
|D|
X
(x,y)∈D
lcross(xi
, yi
, Θ) + λ
X
2
l=1
X
i,j
W2
l,ij (7)
where
lcross(x, y, Θ) = − log(py) (8)
and p has been calculated using equations (1-4). (Note if the label is encoded by a one-hot representation then the cross-entropy loss is defined as
− log(y
Tp).) The optimization problem we have to solve is
Θ∗ = arg min
Θ
J(D, λ, Θ) (9)

In this assignment (as described in the lectures) we will solve this optimization problem via mini-batch gradient descent (with momentum).

For mini-batch gradient descent we begin with a sensible random initialization of the parameters W, b and we then update our estimate for the
parameters with for k = 1, 2
W
(t+1)
k = W
(t)
k − η
∂J(B
(t+1), λ, Θ)
∂Wk

Θ=Θ(t)
(10)
b
(t+1)
k = b
(t)
k − η
∂J(B
(t+1), λ, Θ)
∂bk

Θ=Θ(t)
(11)
where η is the learning rate and B
(t+1) is called a mini-batch and is a random
subset of the training data D and for k = 1, 2:
∂J(B
(t+1), λ, Θ)
∂Wk
=
1
|B(t+1)|
X
(x,y)∈B(t+1)
∂lcross(x, y, Θ)
∂Wk
+ 2λWk (12)
∂J(B
(t+1), λ, Θ)
∂bk
=
1
|B(t+1)|
X
(x,y)∈B(t+1)
∂lcross(x, y, Θ)
∂bk
(13)
2

To compute the relevant gradients for the mini-batch, we then have to compute the gradient of the loss w.r.t. each training example in the mini-batch.

You should refer to the lecture notes for the explicit description of how to
compute these gradients. Once again I would advise you to implement the
Matlab efficient version as it results in significant speed ups.

## Background 2: What learning rate to use with SGD?

There is not really one optimal learning-rate when training a neural network with vanilla mini-batch gradient descent. Choose a too small learning
rate and training will take too long and too large a learning rate may result in training diverging. Ideally, one should have an adaptive learning
rate which changes to match the local shape of the cost surface at the current estimate of the network’s parameters.

Many variants of mini-batch
training try to achieve this – ADAM, mini-batch with a momentum term
etc. – and these variants are covered in the lectures. For this assignment
though we will explore the rather recent idea of exploiting cyclical learning
rates [Smith, 2015] as this approach eliminates much of the trial-and-error
associated with finding a good learning rate and some of the costly hyperparameter optimization over multiple parameters associated with training
with momentum.

The main idea of cyclical learning rates is that during
training the learning rate is periodically changed in a systematic fashion
from a small value to a large one and then from this large value back to
the small value. And this process is then repeated again and again until
training is stopped.

See figure 2 for an illustration of typical example of
how the learning rate is scheduled to change periodically during training.

This is the schedule you will implement.
Assume that you have defined a minimum ηmin and a maximum ηmax learning
rate. ηmin and ηmax define the range of learning rates where learning occurs
without divergence.

(Note, in general, these values will be affected by λ,
network architecture and parameter initialization.) Let ηt represent the
learning rate at the tth update step. One complete cycle will take 2ns
update steps, where ns is known as the stepsize. Note when t = 2lns then
ηt = ηmin and when t = (2l + 1)ns then ηt = ηmax for l = 0, 1, . . .. A rule of
thumb is to set ns = k bn/nbatchc with k being an integer between 2 and 8
and n is the total number of training examples and nbatch in the number of
examples in a batch.

The for a triangular learning rate schedule have
1. At iteration t of training if 2lns ≤ t ≤ (2l + 1)ns for some l ∈
{0, 1, 2, . . .} set
ηt = ηmin +
t − 2lns
ns
(ηmax − ηmin) (14)
3
ns 2ns 3ns 4ns 5ns 6ns
ηmin
ηmax
update step
ηt

Figure 2: Schedule for the cyclic learning rate. The graph above plots the
learning rate ηt at each update step. Initially η1 = ηmin and its value increases
linearly until it has a maximum value of ηmax when t = ns. Then the ηt decreases
linearly until it has a value of η1 = ηmin again when t = 2ns. The cycle can be
repeated as many times as one likes and in this example it is repeated three times.

For most practical applications the number of cycles is ≥ 2 and ≤ 10. The positive
integer ns is known as the stepsize and is usually chosen so that one cycle of training
corresponds to a multiple of epochs of training.
while if (2l + 1)ns ≤ t ≤ 2(l + 1)ns for some l ∈ {0, 1, 2, . . .} set
ηt = ηmax −
t − (2 l + 1)ns
ns
(ηmax − ηmin) (15)

2. Update the current estimate of the parameters with
θt = θt−1 − ηt
∂J
∂θ

θt−1

Normally training is run for a set number of complete cycles and is stopped
when the learning rate is at its smallest, that is t = 2lns for some l ≥ 2. For
this assignment I will give you values for ηmin, ηmax and ns that work well
for the default network used in the assignment. You can read [Smith, 2015],
the paper which forms the basis for this assignment, to get guidelines and
tests about how to set ηmin and ηmax.

## Exercise 1: Read in the data & initialize the parameters of the network

For this assignment (to begin) you will just use the data in the file data batch 1.mat
for training, the file data batch 2.mat for validation and the file test batch.mat
for testing.

You have already written a function for Assignment 1 to read in
the data and pre-process it slightly. For this assignment it helps if we apply
a little more pre-processing to the raw input data. You should transform it
to have zero mean. If trainX is the d × n image data matrix (each column
corresponds to an image) for the training data then
mean X = mean(trainX, 2);
std X = std(trainX, 0, 2);

Then you should normalize the training, validation and test data with respect to this mean and standard deviations. If X is an d × n image data
matrix then you can normalize X as
X = X – repmat(mean X, [1, size(X, 2)]);
X = X ./ repmat(std X, [1, size(X, 2)]);

Note: You could also just divide each X by 255 when you load it, so its
values are between 0 and 1 and then just compute the mean vector of the
training data and centre your data accordingly.

Next you have to set up the data structure for the parameters of the network
and to initialize their values. In the assignment we will just focus on a
network that has m=50 nodes in the hidden layer. As W1 and W2 will have
different sizes, as will b1 and b2, I recommend you use cell arrays to store
these matrices and vectors.

To set their initial values I typically set the bias
vectors to zero and the entries in the weight matrices are random draws from
a Gaussian distribution with mean 0 and standard deviation 1/sqrt(d) for
layer 1 and 1/sqrt(m) for layer 2. (You can use the Matlab function randn
to generate these random draws).

You should probably write a separate
function to initialize the parameters as you will be initializing your network
frequently when you perform a grid search for a good value of lambda.

## Exercise 2: Compute the gradients for the network parameters

Next you will write functions to compute the gradient of your two-layer
network w.r.t. its weight and bias parameters. As before I suggest you reuse much of your code from Assignment1.m You will need to write (update)
the following functions (that you wrote previously):

• Compute the network function, to apply the function defined in figure
1(a), on a mini-batch of data and that returns the final p values and the
intermediary activation values. The latter will be needed to compute

• Compute the gradients of the cost function for a mini-batch of data
given the values computed from the forward pass.

Once you have written the code to compute the gradients the next step is
debugging. Download code from the Canvas page that computes the gradient vectors numerically. Note there are two versions 1) a slower but more
accurate version based on the centered difference formula and 2) a faster
but less accurate based on the finite difference method.

As in Assignment
1 you will probably have to make small changes to the function to make
it compatible with your own code, especially if you have not stored the W’s
and b’s as cell arrays. You should use the same procedures as described in

As in that assignment, when checking
the gradients, you should reduce the dimensionality of the input vector (both
for speed and numerical precision issues) that is call your gradient computing functions (analytical and numerical) with something like trainX(1:20,
1:2), trainY(:, 1:2) and W(:, 1:20). I also found that h≈1e-5 gave the
best precision for the numerical gradients.

are correct then you can move forward with the following sanity check. Try
and train your network on a small amount of the training data (say 100
examples) with regularization turned off (lambda=0) and check if you can
overfit to the training data and get a very low loss on the training data after
training for a sufficient number of epochs (∼200) and with a reasonable η.

mini-batch gradient descent algorithm are okay.

## Exercise 3: Train your network with cyclical learning rates

Up until now you have trained your networks with vanilla mini-batch gradient descent. To help speed up training times and avoid time-consuming
searches for good values of η you will now implement mini-batch gradient
descent training where the learning rate at each update step is defined by
equations (14) and (15) and where you have set eta min = 1e-5, eta max
= 1e-1 and n s=500 and the batch size to 100.

3 shows the training and validation loss/cost I achieved when lambda=.01
and I ran training for one cycle that is from t=1 until t=2n s. Once you
have convinced yourself that you have a bug free implementation of the
cyclic scheduled learning rate then you are ready to somewhat optimize the

## Exercise 4: Train your network for real

Now you should run your training algorithm for more cycles (say 3) and
for a larger n s=800. For reference the performance curves I obtained with
these parameter settings are shown in figure 4. I measured my performance
on the whole training and validation set 9 times per cycle. At the moment
you have not optimized the value of regularization term lambda at all.

200 400 600 800 1,000
1
2
3
4
update step
cost
training
validation
200 400 600 800 1,000
0.5
1
1.5
2
update step
loss
training
validation
200 400 600 800 1,000
0.2
0.4
0.6
update step
accuracy
training
validation
Cost plot Loss plot Accuracy plot

Figure 3: Training curves (cost, loss, accuracy) for one cycle of training.
In this example one batch of the training data is used. The hyper-parameter settings
of the training algorithm are eta min = 1e-5, eta max = 1e-1, lambda=.01 and
n s=500.

The last parameter setting implies, as the batch size is 100, one full cycle
corresponds to 10 epochs of training. In this simple example only one cycle of
training is performed, but already at the end of training a test accuracy of 45.85%
is achieved.

1,000 2,000 3,000 4,000 5,000
1
2
3
4
update step
cost
training
validation
1,000 2,000 3,000 4,000 5,000
0.5
1
1.5
2
update step
loss
training
validation
1,000 2,000 3,000 4,000 5,000
0.2
0.4
0.6
update step
accuracy
training
validation
Cost plot Loss plot Accuracy plot
Figure 4: Training curves (cost, loss, accuracy) for three cycles of
training. In this example one batch of the training data is used.

The hyperparameter settings of the training algorithm are eta min = 1e-5, eta max = 1e-1,
lambda=.01 and n s=800. In the loss and accuracy plots you can clearly see how
the loss and accuracy vary as the ηt varies. After the three cycles of training a test
accuracy of 47.47% is achieved.

Coarse-to-fine random search to set lambda. At this point you may
need to restructure/re-organize your code so that you can cleanly and easily
call function(s) to initialize your network, perform training and then check
the learnt network’s best performance on the validation set.

To perform
your random search you’ll need to train your network from a random initialization and measure its performance (via the accuracy on the validation
set) multiple times as the hyper-parameters lambda varies. You should first
perform a coarse search over a very broad range of values for lambda. To
perform this search you should use most of the training data available and
the rest for the validation.

This is because more data and increasing the
value of lambda are both forms of regularization. When you have less training data you will need a higher lambda and when you have more training
data you will need a lower lambda.

Thus you should perform your search
using the same ballpark amount of data that you will use when you train

Thus for this part of the assignment you should load
all 5 training batches and use all for training except for 5000 images that
should be used as your validation set.

When you train each network you
should only run 2 cycles (1 cycle could also potentially work) of training and
you should set n s = 2 floor(n / n batch) to get a good idea of performance for a given lambda. Search for lambda on a log scale, for example to
generate one random sample for the learning rate uniformly in the range of
10^l min to 10^l max
l = l min + (l max – l min)*rand(1, 1);
lambda = 10^l;

In my (very non-exhaustive) experiments for the coarse search I set l min=-5
and l max=-1. You should perhaps try somewhere between 10-20 different
values for lambda generated by uniform sampling. Save all the parameter
settings tried and their resulting best scores on the validation set to a file.

Inspect this file after finishing the coarse search and see what parameter
ranges gave the best results. Repeat the random search but with your search
adjusted to a narrower range and possibly run training for a few more cycles
than before. Once again save the results and look for the best parameter
settings.

You could do another round of random search or just use the best
lambda found, and then train the network using most of the training data,
for more cycles and for a larger n s and see what final performance you get
on the test set.

You should be getting performances of >50% for your good
settings given ≥ 2 cycles of training (or even just one cycle of training). For
reference I was able to train networks that achieved test accuracies ∼ 52%
without too much exhaustive searching.

To complete the assignment:

To pass the assignment you need to upload to Canvas:
1. The code for your assignment assembled into one file.

2. A brief pdf report with the following content:
Give evidence for these conclusions.

ii) The curves for the training and validation loss/cost when using
the cyclical learning rates with the default values, that is replicate
figures 3 and 4. Also comment on the curves.

iii) State the range of the values you searched for lambda, the number of cycles used for training during the coarse search and the
hyper-parameter settings for the 3 best performing networks you
trained.

iv) State the range of the values you searched for lambda, the number of cycles used for training during the fine search, and the
hyper-parameter settings for the 3 best performing networks you
trained.

v) For your best found lambda setting (according to performance on
the validation set), train the network on all the training data (all
the batch data), except for 1000 examples in a validation set, for
∼3 cycles. Plot the training and validation loss plots and then
report the learnt network’s performance on the test data.

## Exercise 5: Optional for bonus points

1. Optimize the performance of the network. It would be interesting to discover what is the best possible performance achievable by
Assignment 2’s network (a 2-layer fully connected network) on CIFAR10.

Here are some tricks/avenues you can explore to help bump up
performance:
(a) Do a more exhaustive random search to find good values for the amount of
regularization, the length of the cycles, number of cycles etc…

(b) You could also explore whether having more hidden nodes improves the final
classification rate. One would expect that with more hidden nodes then the
amount of regularization would have to increase.

(c) At the end of each cycle of training you should be in the vicinity of a local
minimum. Research in this area seems to indicate that with the cyclical
learning rate method the local minima found at the end of each cycle are
different and thus you can build an ensemble of classifiers by saving the
network parameter values at the end of each cycle. Using this ensemble to
classify may result in better classification than one model.

(d) Apply dropout to your training if you have a high number of hidden nodes
and you feel you need more regularization.

(e) Augment your training data by applying a small random geometric and photometric jitter to a training image each time you encounter it during training.
You can do this on the fly by applying a random jitter to each image in the
mini-batch before doing the forward and backward pass.

Bonus Points Available: 2 points (if you complete at least 3 improvements) – you can follow my suggestions, think of your own or some
combination of the two.

2. Read reference [Smith, 2015] and set eta min and eta max to
the guidelines in the paper. In the assignment I gave you values
I had found for eta min and eta max. You should write code to find
good values for these quantities for yourself and for networks with a
different number of hidden nodes and regularization.

Bonus Points Available: 2 points
To get the bonus point(s) you must upload the following to the Canvas
assignment page Assignment 2 Bonus Points: