## Description

In this assignment you will train and test a one layer network with multiple outputs to classify images from the CIFAR-10 dataset. You will train

the network using mini-batch gradient descent applied to a cost function

that computes the cross-entropy loss of the classifier applied to the labelled

training data and an L2 regularization term on the weight matrix.

## Background 1: Mathematical background

The mathematical details of the network are as follows. Given an input

vector, x, of size d × 1 our classifier outputs a vector of probabilities, p

(K × 1), for each possible output label:

s = Wx + b (1)

p = SOFTMAX(s) (2)

where the matrix W has size K × d, the vector b is K × 1 and SOFTMAX is

defined as

SOFTMAX(s) = exp(s)

1

T exp(s)

(3)

The predicted class corresponds to the label with the highest probability:

k

∗ = arg max

1≤k≤K

{p1, . . . , pK} (4)

x z s p

W b

Wx z + b softmax(s)

x z s p l

W b y

Wx z + b softmax(s) − log(y

T p)

a) Classification function b) Loss function

Figure 1: Computational graph of the classification and loss function that is applied to each input x in this assignment.

The parameters W and b of our classifier are what we have to learn by

exploiting labelled training data. Let D = {(xi

, yi)}

n

i=1, with each yi ∈

{1, . . . , K} and xi ∈ R

d

, represent our labelled training data. In the lectures

we have described how to set the parameters by minimizing the cross-entropy

loss plus a regularization term on W.

Mathematically this cost function is

J(D, λ, W, b) = 1

|D|

X

(x,y)∈D

lcross(x, y, W, b) + λ

X

i,j

W2

ij (5)

1

where

lcross(x, y, W, b) = − log(py) (6)

and p has been calculated using equations (1, 2). (Note if the label is

encoded by a one-hot representation then the cross-entropy loss is defined

as − log(y

Tp).) The optimization problem we have to solve is

W∗

, b

∗ = arg min

W,b

J(D, λ, W, b) (7)

In this assignment (as described in the lectures) we will solve this optimization problem via mini-batch gradient descent.

For mini-batch gradient descent we begin with a sensible random initialization of the parameters W, b and we then update our estimate for the

parameters with

W(t+1) = W(t) − η

∂J(B

(t+1), λ, W, b)

∂W

W=W(t)

,b=b(t)

(8)

b

(t+1) = b

(t) − η

∂J(B

(t+1), λ, W, b)

∂b

W=W(t)

,b=b(t)

(9)

where η is the learning rate and B

(t+1) is called a mini-batch and is a random

subset of the training data D and

∂J(B

(t+1), λ, W, b)

∂W =

1

|B(t+1)|

X

(x,y)∈B(t+1)

∂lcross(x, y, W, b)

∂W + 2λW (10)

∂J(B

(t+1), λ, W, b)

∂b

=

1

|B(t+1)|

X

(x,y)∈B(t+1)

∂lcross(x, y, W, b)

∂b

(11)

To compute the relevant gradients for the mini-batch, we then have to compute the gradient of the loss w.r.t. each training example in the mini-batch.

You should refer to the lecture notes for the explicit description of how to

compute these gradients.

Before Starting

I assume that you will complete the assignment in Matlab. You can complete

the assignment in another programming language. If you do though I will

not answer programming specific questions and you will also probably have

to find a way to display, plot and graph your results.

Besides invoking Matlab commands, you will be required to run a few operating system commands. For these commands I will assume your computer’s

2

x z s p l J

W b y λ

r

Wx z + b softmax(s) − log(y

T p) l + λr

kWk

2

Figure 2: Computational graph of the cost function applied to a mini-batch containing one training example x. If you have a mini-batch of size greater than one,

then the loss computations are repeated for each entry in the mini-batch (as in the

above graph), but the regularization term is only computed once.

operating system is either linux or unix. If otherwise, you’ll have to fend for

yourself. But all the non-Matlab commands needed are more-or-less trivial.

The notes for this assignment, and those to follow, will give you pointers

about which Matlab commands to use. However, I will not give detailed

explanations about their usage.

I assume you have some previous experience

with Matlab, are aware of many of the in-built functions, how to manipulate

vectors and matrices and how to write your own functions etc. Keep in

mind the function help can be called to obtain information about particular

functions.

So for example

>> help plot

will display information about the Matlab command plot.

## Background 2: Getting Started

Set up your environment

Create a new directory to hold all the Matlab files you will write for this

course:

$ mkdir DirName

$ cd DirName

$ mkdir Datasets

$ mkdir Result Pics

Download the CIFAR-10 dataset stored in its Matlab format from this link.

Move the cifar-10-matlab.tar.gz file to the Datasets directory you have

just created, untar the file and then move up to the parent directory. Also

download the file montage.m from the KTH Social website and move it to

DirName.

3

$ mv montage.m DirName/

$ mv cifar-10-matlab.tar.gz DirName/Datasets

$ cd DirName/Datasets

$ tar xvfz cifar-10-matlab.tar.gz

$ cd ..

## Background 3: Useful Display Function

You have copied a function called montage.m, which is a slightly modified

version of the function available at:

http://www.mathworks.com/matlabcentral/fileexchange/22387

This is a useful function as it allows you to efficiently view the images in a

directory or in a Matlab array or a cell array.

To look at some of the images

from the CIFAR-10 dataset you can use the following set of commands:

>> addpath DirName/Datasets/cifar-10-batches-mat/;

>> A = load(’data batch 1.mat’);

>> I = reshape(A.data’, 32, 32, 3, 10000);

>> I = permute(I, [2, 1, 3, 4]);

>> montage(I(:, :, :, 1:500), ’Size’, [5,5]);

This sequence of commands tells Matlab to add the directory of the CIFAR-10

dataset to its path. Then it loads one of the .mat containing image and

label data. You access the image data with the command A.data. It has

size 10000 × 3072.

Each row of A.data corresponds to an image of size

32 × 32 × 3 that has been flattened into a row vector. We can re-arrange

A.data into an array format expected by montage (32×32×3×10000) using

the command reshape. After reshaping the array, the rows and columns of

each image still need to be permuted and this is achieved with the permute

command.

Now you have an array format montage expects. In the above

code we just view the first 500 images. Use help to find out the different

ways montage can be called.

You have looked at some of the CIFAR-10 images. Now it is time to start

writing some code.

## Exercise 1: Training a multi-linear classifier

For this assignment you will just use data in the file data batch 1.mat for

training, the file data batch 2.mat for validation and the file test batch.mat

for testing. Create a file Assignment1.m. In this file you will write the code

for this assignment and the necessary (sub-)functions.

Here are my recommendations for which functions to write and the order in which to write

them:

1. Write a function that reads in the data from a CIFAR-10 batch file

and returns the image and label data in separate files. Make sure to

convert your image data to single or double format and divide it by

255 to convert the data to be between 0 and 1. I would suggest the

function has the following input and outputs

function [X, Y, y] = LoadBatch(filename)

where

• X contains the image pixel data, has size d×N, is of type double or

single and has entries between 0 and 1. N is the number of images

(10000) and d the dimensionality of each image (3072=32×32×3).

• Y is K×N (K= # of labels = 10) and contains the one-hot representation

of the label for each image.

• y is a vector of length N containing the label for each image. A note

of caution. CIFAR-10 encodes the labels as integers between 0-9 but

Matlab indexes matrices and vectors starting at 1. Therefore it may be

easier to encode the labels between 1-10.

This file will not be long. You just need to call A = load(fname);

and then rearrange and store the values in A.data and A.labels.

Top-level: Read in and store the training, validation and test data.

2. Top-Level: After reading in the data, you can initialize the parameters of the model W and b as you now know what size they should be.

W has size K×d and b is K×1. Initialize each entry to have Gaussian

random values with zero mean and standard deviation .01. You should

use the Matlab function randn to create this data.

3. Write a function that evaluates the network function, i.e. equations

(1, 2), on multiple images and returns the results. I would suggest the

function has the following form

function P = EvaluateClassifier(X, W, b)

where

• each column of X corresponds to an image and it has size d×n.

• W and b are the parameters of the network.

• each column of P contains the probability for each label for the image

in the corresponding column of X. P has size K×n.

Top-level: Check the function runs on a subset of the training data

given a random initialization of the network’s parameters:

P = EvaluateClassifier(trainX(:, 1:100), W, b).

4. Write the function that computes the cost function given by equation

(5) for a set of images. I suggest the function has the following inputs

and outputs

function J = ComputeCost(X, Y, W, b, lambda)

where

• each column of X corresponds to an image and X has size d×n.

• each column of Y (K×n) is the one-hot ground truth label for the corresponding column of X or Y is the (1×n) vector of ground truth labels.

• J is a scalar corresponding to the sum of the loss of the network’s

predictions for the images in X relative to the ground truth labels and

the regularization term on W.

5. Write a function that computes the accuracy of the network’s predictions given by equation (4) on a set of data. Remember the accuracy

of a classifier for a given set of examples is the percentage of examples

for which it gets the correct answer.

I suggest the function has the

following inputs and outputs

function acc = ComputeAccuracy(X, y, W, b)

where

• each column of X corresponds to an image and X has size d×n.

• y is the vector of ground truth labels of length n.

• acc is a scalar value containing the accuracy.

6. Write the function that evaluates, for a mini-batch, the gradients of

the cost function w.r.t. W and b, that is equations (10, 11). I suggest

the function has the form

function [grad W, grad b] = ComputeGradients(X, Y, P, W, lambda)

where

• each column of X corresponds to an image and it has size d×n.

• each column of Y (K×n) is the one-hot ground truth label for the corresponding column of X.

• each column of P contains the probability for each label for the image

in the corresponding column of X. P has size K×n.

• grad W is the gradient matrix of the cost J relative to W and has size

K×d.

• grad b is the gradient vector of the cost J relative to b and has size

K×1.

Be sure to check out how you can efficiently compute the gradient for

a batch from the last slide of Lecture 3. This can lead to a much faster

implementation (> 3 times faster) than looping through each training

example in the batch.

Everyone makes mistakes when computing gradients. Therefore you

must always check your analytic gradient computations against numerical estimations of the gradients!

Download code from the Canvas

webpage that computes the gradient vectors numerically. Note there

are two versions 1) a slower but more accurate version based on the

centered difference formula and 2) a faster but less accurate based on

the finite difference method.

You will probably have to make small

changes to the functions to make them compatible with your code. It

will take some time to run the numerical gradient calculations as your

network has 32 × 32 × 3 × 10 different parameters in W. Initially, you

should just perform your checks on mini-batches of size 1 and with no

regularization (lambda=0).

Afterwards you can increase the size of the

mini-batch and include regularization into the computations. Another

trick is that you should send in a reduced dimensionality version of

trainX and W, so instead of

[ngrad b, ngrad W] = ComputeGradsNumSlow(trainX(:, 1), trainY(:, 1),

W, b, lambda, 1e-6);

you can compute the gradients numerically for smaller inputs (the first

20 dimensions of the first training example)

[ngrad b, ngrad W] = ComputeGradsNumSlow(trainX(1:20, 1), trainY(:, 1),

W(:, 1:20), b, lambda, 1e-6);

You should then also compute your analytical gradients on this reduced

version of the input data with reduced dimensionality. This will speed

up computations and also reduce the risk of numerical precision issues

(very possible when the full W is initialized with very small numbers

and trainX also contains small numbers).

You could compare the numerically and analytically computed gradient vectors (matrices) by examining their absolute differences and

declaring, if all these absolutes difference are small (<1e-6), then they

have produced the same result. However, when the gradient vectors

have small values this approach may fail.

A more reliable method is to

compute the relative error between a numerically computed gradient

value gn and an analytically computed gradient value ga

|ga − gn|

max(eps, |ga| + |gn|)

where eps a very small positive number

and check this is small.

There are potentially more issues that can

plague numerical gradient checking (especially when you start to train

deep rectifier networks), so I suggest you read the relevant section of

the Additional material for lecture 3 from Standford’s course Convolutional Neural Networks for Visual Recognition for a more

thorough exposition especially for the subsequent assignments.

Do not continue with the rest of this assignment until you are sure

your analytic gradient code is correct. If you are having problems, set

the seed of the random number generator with the command rng to

ensure at each test W and b have the same values and double/triple

check that you have a correct implementation of the gradient equations

from the lecture notes.

7. Once you have the gradient computations debugged you are now ready

to write the code to perform the mini-batch gradient descent algorithm

to learn the network’s parameters where the updates are defined in

equations (8, 9).

You have a couple of parameters controlling the

learning algorithm (for this assignment you will just implement the

most vanilla version of the mini-batch gradient descent algorithm, with

no adaptive tuning of the learning rate or momentum terms):

• n batch the size of the mini-batches

• eta the learning rate

• n epochs the number of runs through the whole training set.

As the images in the CIFAR-10 dataset are in random order, the easiest to generate each mini-batch is to just run through the images

sequentially. Let n batch be the number of images in a mini-batch.

Then for one epoch (a complete run through all the training images),

you can generate the set of mini-batches with this snippet of code:

for j=1:N/n batch

j start = (j-1)*n batch + 1;

j end = j*n batch;

inds = j start:j end;

Xbatch = Xtrain(:, j start:j end);

Ybatch = Ytrain(:, j start:j end);

end

(A slight upgrade of this default implementation is to randomly shuffle

your training examples before each epoch. One efficient way to do this

is via the command randperm which when given the input N returns a

vector containing a random permutation of the integers 1:N.)

I suggest

the mini-batch learning function has these inputs and outputs

function [Wstar, bstar] = MiniBatchGD(X, Y, GDparams, W, b, lambda)

where X contains all the training images, Y the labels for the training

images, W, b are the initial values for the network’s parameters, lambda

is the regularization factor in the cost function and

• GDparams is an object containing the parameter values n batch, eta

and n epochs

For my initial experiments I set n batch=100, eta=.01, n epochs=20

and lambda=0. To help you debug I suggest that after each epoch you

compute the cost function and print it out (and save it) on all the

training data.

For these parameter settings you should see that the

training cost decreases for each epoch. After the first epoch my cost

score on all the training data was 2.0276 where I had set the random

number seed generator to rng(400) and I had initialized the weight

matrix before the bias vector. In figure 7 you can see the training cost

score when I run these parameter settings for 40 epochs. The cost

score on the validation set is plotted in red in the same figure.

(Note: in Tensorflow and other software packages they count in the

number of update steps as opposed to the number of training epochs.

Given N training images and mini-batches of size n batch then one

epoch will result in N/n batch update steps of the parameter values.

Obviously, the performance of the resulting network depends on the

number of update steps and how much training data is used. Therefore

to make fair comparisons one should make sure the number of update

steps is consistent across runs – for instance if you change the size of

the mini-batch, number of training images, etc.. you may need to run

more or fewer epochs of training.)

10 20 30 40

1.7

1.8

1.9

2

epoch

loss training loss

validation loss

Figure 3: The graph of the training and validation loss computed after every epoch.

The network was trained with the following parameter settings: n batch=100,

eta=.01, n epochs=40 and lambda=0.

When you have finished training you can compute the accuracy of

your learnt classifier on the test data. My network achieves (after

40 epochs) an accuracy of 36.69% (without a random shuffling of the

training example at the beginning of each epoch). Much better than

random but not great. Hopefully, we will achieve improved accuracies

in the assignments to come when we build and train more complex

notworks.

After training you can also visualization the weight matrix W as an image and see what class templates your network has learnt. Figure

4 shows the templates my network learnt. Here is a code snippet to

re-arrange each row of W (assuming W is a 10×d matrix) into a set of

images that can be displayed by Matlab.

for i=1:10

im = reshape(W(i, :), 32, 32, 3);

s im{i} = (im – min(im(:))) / (max(im(:)) – min(im(:)));

s im{i} = permute(s im{i}, [2, 1, 3]);

end

Then you can use either imshow, imagesc or montage to display the

images in the cell array s im.

Figure 4: The learnt W matrix visualized as class template images. The network was

trained with the following parameter settings: n batch=100, eta=.01, n epochs=40

and lambda=0.

To complete the assignment:

In the DD2424 Canvas Assignment 1 page upload:

• The final (and cleaned up) version of the code you have written for

the assignment. If you have written the code in multiple files please

place them in one file for the uploaded version. This will make the life

easier for me and the TAs! Note we will not run your code.

• A pdf document containing a short report. Please do not zip your

code and pdf document into on file and upload the zip file. Then the

graders have to download the zip file and uncompress it and this is time

consuming. Here you should state whether you successfully managed

to write the functions to correctly compute the gradient analytically,

what tests you ran to check against the numerically computed gradient and the results of these tests.

You should include the following

plots/figures

1. Graphs of the total loss and the cost function on the training data and

the validation data after each epoch of the mini-batch gradient descent

algorithm.

2. Images representing the learnt weight matrix after the completion of

training.

for the following parameter settings

– lambda=0, n epochs=40, n batch=100, eta=.1

– lambda=0, n epochs=40, n batch=100, eta=.01

– lambda=.1, n epochs=40, n batch=100, eta=.01

– lambda=1, n epochs=40, n batch=100, eta=.01

You should also report the final test accuracy your network achieves

after each of these training runs. You should also make a short comment on the effect of increasing the amount of regularization and the

importance of the correct learning rate.

## Exercise 2: Optional for Bonus Points

1. Optimize the performance of the network It would be interesting to discover what is the best possible performance achievable by

Assignment 1’s network on CIFAR-10. Here are some tricks/avenues

you can explore to help bump up performance (most of the same tricks

you can use for more complicated networks and they would probably

have more of an effect for those networks):

(a) Use all the available training data for training (all five batches minus a small

subset of the training images for a validation set). Decrease the size of the

validation set down to ∼ 1000.

(b) Train for a longer time and use your validation set to make sure you don’t

overfit or to keep a record of the best model before you begin to overfit.

(c) Do a grid search to find good values for the amount of regularization, the

learning rate and the batch size. There is some significant empirical evidence

that training with smaller batch sizes when using SGD training leads to better

generalization.

(d) Play around with decaying the learning rate by a factor ∼ .9 after each epoch.

Or you can decay the learning rate by a factor of 10 after every nth epoch.

(e) Implement Xavier initialization and comment if it stabilizes training.

(f) Augment your training data by applying small random geometric and photometric jitter to the original training data. You can do this on the fly by

applying a random jitter to each image in the mini-batch before doing the

forward and backward pass.

(g) Shuffle the order of your training examples at the beginning of every epoch.

(h) Train several networks using different initializations of the network’s parameters. Then apply the ensemble to a test image and use the majority vote as

the prediction.

Bonus Points Available: 1 point (if you complete at least 3 improvements – you can follow my suggestions and/or think of your own or

some combination.)

2. Train network by minimizing the SVM multi-class loss instead

of the cross-entropy loss. Of course, you will have to write a new

function to compute the gradients.

Bonus Points Available: 2 points

To get the bonus point(s) you must upload the following to the Canvas page

Assignment 1 Bonus Points:

1. Your code.

2. A Pdf document which

– reports on your trained network with the best test accuracy,

what improvements you made and which ones brought the largest

gains. (Exercise 2.1)

– compares the test accuracy of the network trained with the SVM

loss compared to the cross-entropy loss (for several sensible training parameter settings). (Exercise 2.2)