## Description

In this assignment you will train and test a two layer network with multiple outputs to classify images from the CIFAR-10 dataset. You will train

the network using mini-batch gradient descent applied to a cost function

that computes the cross-entropy loss of the classifier applied to the labelled

training data and an L2 regularization term on the weight matrix.

The overall structure of your code for this assignment should mimic that

from Assignment 1. You will have more parameters than before and you

will have to change the functions that 1) evaluate the network (the forward

pass) and 2) compute the gradients (the backward pass). We will also be

paying more attention to how to search for good parameter settings for the

network’s regularization term and the learning rate. Welcome to the nitty

gritty of training neural networks!

## Background 1: Mathematical background

The mathematical details of the network are as follows. Given an input

vector, x, of size d × 1 our classifier outputs a vector of probabilities, p

(K × 1), for each possible output label:

s1 = W1x + b1 (1)

h = max(0, s1) (2)

s = W2h + b2 (3)

p = SOFTMAX(s) (4)

where the matrix W1 and W2 have size m × d and K × m respectively and

the vectors b1 and b2 have sizes m × 1 and K × 1. SOFTMAX is defined as

SOFTMAX(s) = exp(s)

1

T exp(s)

(5)

The predicted class corresponds to the label with the highest probability:

k

∗ = arg max

1≤k≤K

{p1, . . . , pK} (6)

We have to learn the parameters W1, W2, b1 and b2 from our labelled

training data. Let D = {(xi

, yi)}

n

i=1, with each yi ∈ {1, . . . , K} and xi ∈ R

d

,

represent our labelled training data.

In the lectures we have described how to

set the parameters by minimizing the cross-entropy loss plus a regularization

term on W1 and W2. To simplify the notation we group the parameters of

1

x s1 h s p

W1 b1 W2 b2

W1x + b1 max(0, s1) W2h + b2

softmax(s) x s1 h s p l J

W1 b1 W2 b2 y λ

r

W1x + b1 max(0, s1) W2h + b2

softmax(s) − log(y

T p) l + λr

kW1k

2 + kW2k

a) Classification function b) Cost function

Figure 1: For this assignment the computational graph of the classification function

applied to an input x and the computational graph of the cost function applied to

a mini-batch of size 1.

the model as Θ = {W1, W2, b1, b2}. The cost function is

J(D, λ, Θ) = 1

|D|

X

(x,y)∈D

lcross(xi

, yi

, Θ) + λ

X

2

l=1

X

i,j

W2

l,ij (7)

where

lcross(x, y, Θ) = − log(py) (8)

and p has been calculated using equations (1-4). (Note if the label is encoded by a one-hot representation then the cross-entropy loss is defined as

− log(y

Tp).) The optimization problem we have to solve is

Θ∗ = arg min

Θ

J(D, λ, Θ) (9)

In this assignment (as described in the lectures) we will solve this optimization problem via mini-batch gradient descent (with momentum).

For mini-batch gradient descent we begin with a sensible random initialization of the parameters W, b and we then update our estimate for the

parameters with for k = 1, 2

W

(t+1)

k = W

(t)

k − η

∂J(B

(t+1), λ, Θ)

∂Wk

Θ=Θ(t)

(10)

b

(t+1)

k = b

(t)

k − η

∂J(B

(t+1), λ, Θ)

∂bk

Θ=Θ(t)

(11)

where η is the learning rate and B

(t+1) is called a mini-batch and is a random

subset of the training data D and for k = 1, 2:

∂J(B

(t+1), λ, Θ)

∂Wk

=

1

|B(t+1)|

X

(x,y)∈B(t+1)

∂lcross(x, y, Θ)

∂Wk

+ 2λWk (12)

∂J(B

(t+1), λ, Θ)

∂bk

=

1

|B(t+1)|

X

(x,y)∈B(t+1)

∂lcross(x, y, Θ)

∂bk

(13)

2

To compute the relevant gradients for the mini-batch, we then have to compute the gradient of the loss w.r.t. each training example in the mini-batch.

You should refer to the lecture notes for the explicit description of how to

compute these gradients. Once again I would advise you to implement the

Matlab efficient version as it results in significant speed ups.

## Background 2: What learning rate to use with SGD?

There is not really one optimal learning-rate when training a neural network with vanilla mini-batch gradient descent. Choose a too small learning

rate and training will take too long and too large a learning rate may result in training diverging. Ideally, one should have an adaptive learning

rate which changes to match the local shape of the cost surface at the current estimate of the network’s parameters.

Many variants of mini-batch

training try to achieve this – ADAM, mini-batch with a momentum term

etc. – and these variants are covered in the lectures. For this assignment

though we will explore the rather recent idea of exploiting cyclical learning

rates [Smith, 2015] as this approach eliminates much of the trial-and-error

associated with finding a good learning rate and some of the costly hyperparameter optimization over multiple parameters associated with training

with momentum.

The main idea of cyclical learning rates is that during

training the learning rate is periodically changed in a systematic fashion

from a small value to a large one and then from this large value back to

the small value. And this process is then repeated again and again until

training is stopped.

See figure 2 for an illustration of typical example of

how the learning rate is scheduled to change periodically during training.

This is the schedule you will implement.

Assume that you have defined a minimum ηmin and a maximum ηmax learning

rate. ηmin and ηmax define the range of learning rates where learning occurs

without divergence.

(Note, in general, these values will be affected by λ,

network architecture and parameter initialization.) Let ηt represent the

learning rate at the tth update step. One complete cycle will take 2ns

update steps, where ns is known as the stepsize. Note when t = 2lns then

ηt = ηmin and when t = (2l + 1)ns then ηt = ηmax for l = 0, 1, . . .. A rule of

thumb is to set ns = k bn/nbatchc with k being an integer between 2 and 8

and n is the total number of training examples and nbatch in the number of

examples in a batch.

The for a triangular learning rate schedule have

1. At iteration t of training if 2lns ≤ t ≤ (2l + 1)ns for some l ∈

{0, 1, 2, . . .} set

ηt = ηmin +

t − 2lns

ns

(ηmax − ηmin) (14)

3

ns 2ns 3ns 4ns 5ns 6ns

ηmin

ηmax

update step

ηt

Figure 2: Schedule for the cyclic learning rate. The graph above plots the

learning rate ηt at each update step. Initially η1 = ηmin and its value increases

linearly until it has a maximum value of ηmax when t = ns. Then the ηt decreases

linearly until it has a value of η1 = ηmin again when t = 2ns. The cycle can be

repeated as many times as one likes and in this example it is repeated three times.

For most practical applications the number of cycles is ≥ 2 and ≤ 10. The positive

integer ns is known as the stepsize and is usually chosen so that one cycle of training

corresponds to a multiple of epochs of training.

while if (2l + 1)ns ≤ t ≤ 2(l + 1)ns for some l ∈ {0, 1, 2, . . .} set

ηt = ηmax −

t − (2 l + 1)ns

ns

(ηmax − ηmin) (15)

2. Update the current estimate of the parameters with

θt = θt−1 − ηt

∂J

∂θ

θt−1

Normally training is run for a set number of complete cycles and is stopped

when the learning rate is at its smallest, that is t = 2lns for some l ≥ 2. For

this assignment I will give you values for ηmin, ηmax and ns that work well

for the default network used in the assignment. You can read [Smith, 2015],

the paper which forms the basis for this assignment, to get guidelines and

tests about how to set ηmin and ηmax.

## Exercise 1: Read in the data & initialize the parameters of the network

For this assignment (to begin) you will just use the data in the file data batch 1.mat

for training, the file data batch 2.mat for validation and the file test batch.mat

for testing.

You have already written a function for Assignment 1 to read in

the data and pre-process it slightly. For this assignment it helps if we apply

a little more pre-processing to the raw input data. You should transform it

to have zero mean. If trainX is the d × n image data matrix (each column

corresponds to an image) for the training data then

mean X = mean(trainX, 2);

std X = std(trainX, 0, 2);

Then you should normalize the training, validation and test data with respect to this mean and standard deviations. If X is an d × n image data

matrix then you can normalize X as

X = X – repmat(mean X, [1, size(X, 2)]);

X = X ./ repmat(std X, [1, size(X, 2)]);

Note: You could also just divide each X by 255 when you load it, so its

values are between 0 and 1 and then just compute the mean vector of the

training data and centre your data accordingly.

Next you have to set up the data structure for the parameters of the network

and to initialize their values. In the assignment we will just focus on a

network that has m=50 nodes in the hidden layer. As W1 and W2 will have

different sizes, as will b1 and b2, I recommend you use cell arrays to store

these matrices and vectors.

To set their initial values I typically set the bias

vectors to zero and the entries in the weight matrices are random draws from

a Gaussian distribution with mean 0 and standard deviation 1/sqrt(d) for

layer 1 and 1/sqrt(m) for layer 2. (You can use the Matlab function randn

to generate these random draws).

You should probably write a separate

function to initialize the parameters as you will be initializing your network

frequently when you perform a grid search for a good value of lambda.

## Exercise 2: Compute the gradients for the network parameters

Next you will write functions to compute the gradient of your two-layer

network w.r.t. its weight and bias parameters. As before I suggest you reuse much of your code from Assignment1.m You will need to write (update)

the following functions (that you wrote previously):

• Compute the network function, to apply the function defined in figure

1(a), on a mini-batch of data and that returns the final p values and the

intermediary activation values. The latter will be needed to compute

the gradients.

• Compute the gradients of the cost function for a mini-batch of data

given the values computed from the forward pass.

Once you have written the code to compute the gradients the next step is

debugging. Download code from the Canvas page that computes the gradient vectors numerically. Note there are two versions 1) a slower but more

accurate version based on the centered difference formula and 2) a faster

but less accurate based on the finite difference method.

As in Assignment

1 you will probably have to make small changes to the function to make

it compatible with your own code, especially if you have not stored the W’s

and b’s as cell arrays. You should use the same procedures as described in

Assignment1 to check the gradients.

As in that assignment, when checking

the gradients, you should reduce the dimensionality of the input vector (both

for speed and numerical precision issues) that is call your gradient computing functions (analytical and numerical) with something like trainX(1:20,

1:2), trainY(:, 1:2) and W(:, 1:20). I also found that h≈1e-5 gave the

best precision for the numerical gradients.

Once you have convinced yourself that your analytic gradient computations

are correct then you can move forward with the following sanity check. Try

and train your network on a small amount of the training data (say 100

examples) with regularization turned off (lambda=0) and check if you can

overfit to the training data and get a very low loss on the training data after

training for a sufficient number of epochs (∼200) and with a reasonable η.

Being able to achieve this indicates that your gradient computations and

mini-batch gradient descent algorithm are okay.

## Exercise 3: Train your network with cyclical learning rates

Up until now you have trained your networks with vanilla mini-batch gradient descent. To help speed up training times and avoid time-consuming

searches for good values of η you will now implement mini-batch gradient

descent training where the learning rate at each update step is defined by

equations (14) and (15) and where you have set eta min = 1e-5, eta max

= 1e-1 and n s=500 and the batch size to 100.

To help you debug figure

3 shows the training and validation loss/cost I achieved when lambda=.01

and I ran training for one cycle that is from t=1 until t=2n s. Once you

have convinced yourself that you have a bug free implementation of the

cyclic scheduled learning rate then you are ready to somewhat optimize the

performance of your network.

## Exercise 4: Train your network for real

Now you should run your training algorithm for more cycles (say 3) and

for a larger n s=800. For reference the performance curves I obtained with

these parameter settings are shown in figure 4. I measured my performance

on the whole training and validation set 9 times per cycle. At the moment

you have not optimized the value of regularization term lambda at all.

200 400 600 800 1,000

1

2

3

4

update step

cost

training

validation

200 400 600 800 1,000

0.5

1

1.5

2

update step

loss

training

validation

200 400 600 800 1,000

0.2

0.4

0.6

update step

accuracy

training

validation

Cost plot Loss plot Accuracy plot

Figure 3: Training curves (cost, loss, accuracy) for one cycle of training.

In this example one batch of the training data is used. The hyper-parameter settings

of the training algorithm are eta min = 1e-5, eta max = 1e-1, lambda=.01 and

n s=500.

The last parameter setting implies, as the batch size is 100, one full cycle

corresponds to 10 epochs of training. In this simple example only one cycle of

training is performed, but already at the end of training a test accuracy of 45.85%

is achieved.

1,000 2,000 3,000 4,000 5,000

1

2

3

4

update step

cost

training

validation

1,000 2,000 3,000 4,000 5,000

0.5

1

1.5

2

update step

loss

training

validation

1,000 2,000 3,000 4,000 5,000

0.2

0.4

0.6

update step

accuracy

training

validation

Cost plot Loss plot Accuracy plot

Figure 4: Training curves (cost, loss, accuracy) for three cycles of

training. In this example one batch of the training data is used.

The hyperparameter settings of the training algorithm are eta min = 1e-5, eta max = 1e-1,

lambda=.01 and n s=800. In the loss and accuracy plots you can clearly see how

the loss and accuracy vary as the ηt varies. After the three cycles of training a test

accuracy of 47.47% is achieved.

Coarse-to-fine random search to set lambda. At this point you may

need to restructure/re-organize your code so that you can cleanly and easily

call function(s) to initialize your network, perform training and then check

the learnt network’s best performance on the validation set.

To perform

your random search you’ll need to train your network from a random initialization and measure its performance (via the accuracy on the validation

set) multiple times as the hyper-parameters lambda varies. You should first

perform a coarse search over a very broad range of values for lambda. To

perform this search you should use most of the training data available and

the rest for the validation.

This is because more data and increasing the

value of lambda are both forms of regularization. When you have less training data you will need a higher lambda and when you have more training

data you will need a lower lambda.

Thus you should perform your search

using the same ballpark amount of data that you will use when you train

your final network.

Thus for this part of the assignment you should load

all 5 training batches and use all for training except for 5000 images that

should be used as your validation set.

When you train each network you

should only run 2 cycles (1 cycle could also potentially work) of training and

you should set n s = 2 floor(n / n batch) to get a good idea of performance for a given lambda. Search for lambda on a log scale, for example to

generate one random sample for the learning rate uniformly in the range of

10^l min to 10^l max

l = l min + (l max – l min)*rand(1, 1);

lambda = 10^l;

In my (very non-exhaustive) experiments for the coarse search I set l min=-5

and l max=-1. You should perhaps try somewhere between 10-20 different

values for lambda generated by uniform sampling. Save all the parameter

settings tried and their resulting best scores on the validation set to a file.

Inspect this file after finishing the coarse search and see what parameter

ranges gave the best results. Repeat the random search but with your search

adjusted to a narrower range and possibly run training for a few more cycles

than before. Once again save the results and look for the best parameter

settings.

You could do another round of random search or just use the best

lambda found, and then train the network using most of the training data,

for more cycles and for a larger n s and see what final performance you get

on the test set.

You should be getting performances of >50% for your good

settings given ≥ 2 cycles of training (or even just one cycle of training). For

reference I was able to train networks that achieved test accuracies ∼ 52%

without too much exhaustive searching.

To complete the assignment:

To pass the assignment you need to upload to Canvas:

1. The code for your assignment assembled into one file.

2. A brief pdf report with the following content:

i) State how you checked your analytic gradient computations and

whether you think that your gradient computations were bug free.

Give evidence for these conclusions.

ii) The curves for the training and validation loss/cost when using

the cyclical learning rates with the default values, that is replicate

figures 3 and 4. Also comment on the curves.

iii) State the range of the values you searched for lambda, the number of cycles used for training during the coarse search and the

hyper-parameter settings for the 3 best performing networks you

trained.

iv) State the range of the values you searched for lambda, the number of cycles used for training during the fine search, and the

hyper-parameter settings for the 3 best performing networks you

trained.

v) For your best found lambda setting (according to performance on

the validation set), train the network on all the training data (all

the batch data), except for 1000 examples in a validation set, for

∼3 cycles. Plot the training and validation loss plots and then

report the learnt network’s performance on the test data.

## Exercise 5: Optional for bonus points

1. Optimize the performance of the network. It would be interesting to discover what is the best possible performance achievable by

Assignment 2’s network (a 2-layer fully connected network) on CIFAR10.

Here are some tricks/avenues you can explore to help bump up

performance:

(a) Do a more exhaustive random search to find good values for the amount of

regularization, the length of the cycles, number of cycles etc…

(b) You could also explore whether having more hidden nodes improves the final

classification rate. One would expect that with more hidden nodes then the

amount of regularization would have to increase.

(c) At the end of each cycle of training you should be in the vicinity of a local

minimum. Research in this area seems to indicate that with the cyclical

learning rate method the local minima found at the end of each cycle are

different and thus you can build an ensemble of classifiers by saving the

network parameter values at the end of each cycle. Using this ensemble to

classify may result in better classification than one model.

(d) Apply dropout to your training if you have a high number of hidden nodes

and you feel you need more regularization.

(e) Augment your training data by applying a small random geometric and photometric jitter to a training image each time you encounter it during training.

You can do this on the fly by applying a random jitter to each image in the

mini-batch before doing the forward and backward pass.

Bonus Points Available: 2 points (if you complete at least 3 improvements) – you can follow my suggestions, think of your own or some

combination of the two.

2. Read reference [Smith, 2015] and set eta min and eta max to

the guidelines in the paper. In the assignment I gave you values

I had found for eta min and eta max. You should write code to find

good values for these quantities for yourself and for networks with a

different number of hidden nodes and regularization.

Bonus Points Available: 2 points

To get the bonus point(s) you must upload the following to the Canvas

assignment page Assignment 2 Bonus Points:

1. Your code.

2. A Pdf document which

– reports on your trained network with the best test accuracy, what

improvements you made and which ones brought the largest gains

(if any!). (Exercise 5.1)

– summarizes briefly how you computed eta min and eta max, the

initial training loss/cost plot used to find these quantities and the

training loss/cost plot for the final network you train with your

newly found settings. (Exercise 5.2)

References

[Smith, 2015] Smith, L. N. (2015). Cyclical learning rates for training neural

networks. arXiv:1506.01186 [cs.CV].