Description
In this assignment you will train and test a one layer network with multiple outputs to classify images from the CIFAR-10 dataset. You will train
the network using mini-batch gradient descent applied to a cost function
that computes the cross-entropy loss of the classifier applied to the labelled
training data and an L2 regularization term on the weight matrix.
Background 1: Mathematical background
The mathematical details of the network are as follows. Given an input
vector, x, of size d × 1 our classifier outputs a vector of probabilities, p
(K × 1), for each possible output label:
s = Wx + b (1)
p = SOFTMAX(s) (2)
where the matrix W has size K × d, the vector b is K × 1 and SOFTMAX is
defined as
SOFTMAX(s) = exp(s)
1
T exp(s)
(3)
The predicted class corresponds to the label with the highest probability:
k
∗ = arg max
1≤k≤K
{p1, . . . , pK} (4)
x z s p
W b
Wx z + b softmax(s)
x z s p l
W b y
Wx z + b softmax(s) − log(y
T p)
a) Classification function b) Loss function
Figure 1: Computational graph of the classification and loss function that is applied to each input x in this assignment.
The parameters W and b of our classifier are what we have to learn by
exploiting labelled training data. Let D = {(xi
, yi)}
n
i=1, with each yi ∈
{1, . . . , K} and xi ∈ R
d
, represent our labelled training data. In the lectures
we have described how to set the parameters by minimizing the cross-entropy
loss plus a regularization term on W.
Mathematically this cost function is
J(D, λ, W, b) = 1
|D|
X
(x,y)∈D
lcross(x, y, W, b) + λ
X
i,j
W2
ij (5)
1
where
lcross(x, y, W, b) = − log(py) (6)
and p has been calculated using equations (1, 2). (Note if the label is
encoded by a one-hot representation then the cross-entropy loss is defined
as − log(y
Tp).) The optimization problem we have to solve is
W∗
, b
∗ = arg min
W,b
J(D, λ, W, b) (7)
In this assignment (as described in the lectures) we will solve this optimization problem via mini-batch gradient descent.
For mini-batch gradient descent we begin with a sensible random initialization of the parameters W, b and we then update our estimate for the
parameters with
W(t+1) = W(t) − η
∂J(B
(t+1), λ, W, b)
∂W
W=W(t)
,b=b(t)
(8)
b
(t+1) = b
(t) − η
∂J(B
(t+1), λ, W, b)
∂b
W=W(t)
,b=b(t)
(9)
where η is the learning rate and B
(t+1) is called a mini-batch and is a random
subset of the training data D and
∂J(B
(t+1), λ, W, b)
∂W =
1
|B(t+1)|
X
(x,y)∈B(t+1)
∂lcross(x, y, W, b)
∂W + 2λW (10)
∂J(B
(t+1), λ, W, b)
∂b
=
1
|B(t+1)|
X
(x,y)∈B(t+1)
∂lcross(x, y, W, b)
∂b
(11)
To compute the relevant gradients for the mini-batch, we then have to compute the gradient of the loss w.r.t. each training example in the mini-batch.
You should refer to the lecture notes for the explicit description of how to
compute these gradients.
Before Starting
I assume that you will complete the assignment in Matlab. You can complete
the assignment in another programming language. If you do though I will
not answer programming specific questions and you will also probably have
to find a way to display, plot and graph your results.
Besides invoking Matlab commands, you will be required to run a few operating system commands. For these commands I will assume your computer’s
2
x z s p l J
W b y λ
r
Wx z + b softmax(s) − log(y
T p) l + λr
kWk
2
Figure 2: Computational graph of the cost function applied to a mini-batch containing one training example x. If you have a mini-batch of size greater than one,
then the loss computations are repeated for each entry in the mini-batch (as in the
above graph), but the regularization term is only computed once.
operating system is either linux or unix. If otherwise, you’ll have to fend for
yourself. But all the non-Matlab commands needed are more-or-less trivial.
The notes for this assignment, and those to follow, will give you pointers
about which Matlab commands to use. However, I will not give detailed
explanations about their usage.
I assume you have some previous experience
with Matlab, are aware of many of the in-built functions, how to manipulate
vectors and matrices and how to write your own functions etc. Keep in
mind the function help can be called to obtain information about particular
functions.
So for example
>> help plot
will display information about the Matlab command plot.
Background 2: Getting Started
Set up your environment
Create a new directory to hold all the Matlab files you will write for this
course:
$ mkdir DirName
$ cd DirName
$ mkdir Datasets
$ mkdir Result Pics
Download the CIFAR-10 dataset stored in its Matlab format from this link.
Move the cifar-10-matlab.tar.gz file to the Datasets directory you have
just created, untar the file and then move up to the parent directory. Also
download the file montage.m from the KTH Social website and move it to
DirName.
3
$ mv montage.m DirName/
$ mv cifar-10-matlab.tar.gz DirName/Datasets
$ cd DirName/Datasets
$ tar xvfz cifar-10-matlab.tar.gz
$ cd ..
Background 3: Useful Display Function
You have copied a function called montage.m, which is a slightly modified
version of the function available at:
http://www.mathworks.com/matlabcentral/fileexchange/22387
This is a useful function as it allows you to efficiently view the images in a
directory or in a Matlab array or a cell array.
To look at some of the images
from the CIFAR-10 dataset you can use the following set of commands:
>> addpath DirName/Datasets/cifar-10-batches-mat/;
>> A = load(’data batch 1.mat’);
>> I = reshape(A.data’, 32, 32, 3, 10000);
>> I = permute(I, [2, 1, 3, 4]);
>> montage(I(:, :, :, 1:500), ’Size’, [5,5]);
This sequence of commands tells Matlab to add the directory of the CIFAR-10
dataset to its path. Then it loads one of the .mat containing image and
label data. You access the image data with the command A.data. It has
size 10000 × 3072.
Each row of A.data corresponds to an image of size
32 × 32 × 3 that has been flattened into a row vector. We can re-arrange
A.data into an array format expected by montage (32×32×3×10000) using
the command reshape. After reshaping the array, the rows and columns of
each image still need to be permuted and this is achieved with the permute
command.
Now you have an array format montage expects. In the above
code we just view the first 500 images. Use help to find out the different
ways montage can be called.
You have looked at some of the CIFAR-10 images. Now it is time to start
writing some code.
Exercise 1: Training a multi-linear classifier
For this assignment you will just use data in the file data batch 1.mat for
training, the file data batch 2.mat for validation and the file test batch.mat
for testing. Create a file Assignment1.m. In this file you will write the code
for this assignment and the necessary (sub-)functions.
Here are my recommendations for which functions to write and the order in which to write
them:
1. Write a function that reads in the data from a CIFAR-10 batch file
and returns the image and label data in separate files. Make sure to
convert your image data to single or double format and divide it by
255 to convert the data to be between 0 and 1. I would suggest the
function has the following input and outputs
function [X, Y, y] = LoadBatch(filename)
where
• X contains the image pixel data, has size d×N, is of type double or
single and has entries between 0 and 1. N is the number of images
(10000) and d the dimensionality of each image (3072=32×32×3).
• Y is K×N (K= # of labels = 10) and contains the one-hot representation
of the label for each image.
• y is a vector of length N containing the label for each image. A note
of caution. CIFAR-10 encodes the labels as integers between 0-9 but
Matlab indexes matrices and vectors starting at 1. Therefore it may be
easier to encode the labels between 1-10.
This file will not be long. You just need to call A = load(fname);
and then rearrange and store the values in A.data and A.labels.
Top-level: Read in and store the training, validation and test data.
2. Top-Level: After reading in the data, you can initialize the parameters of the model W and b as you now know what size they should be.
W has size K×d and b is K×1. Initialize each entry to have Gaussian
random values with zero mean and standard deviation .01. You should
use the Matlab function randn to create this data.
3. Write a function that evaluates the network function, i.e. equations
(1, 2), on multiple images and returns the results. I would suggest the
function has the following form
function P = EvaluateClassifier(X, W, b)
where
• each column of X corresponds to an image and it has size d×n.
• W and b are the parameters of the network.
• each column of P contains the probability for each label for the image
in the corresponding column of X. P has size K×n.
Top-level: Check the function runs on a subset of the training data
given a random initialization of the network’s parameters:
P = EvaluateClassifier(trainX(:, 1:100), W, b).
4. Write the function that computes the cost function given by equation
(5) for a set of images. I suggest the function has the following inputs
and outputs
function J = ComputeCost(X, Y, W, b, lambda)
where
• each column of X corresponds to an image and X has size d×n.
• each column of Y (K×n) is the one-hot ground truth label for the corresponding column of X or Y is the (1×n) vector of ground truth labels.
• J is a scalar corresponding to the sum of the loss of the network’s
predictions for the images in X relative to the ground truth labels and
the regularization term on W.
5. Write a function that computes the accuracy of the network’s predictions given by equation (4) on a set of data. Remember the accuracy
of a classifier for a given set of examples is the percentage of examples
for which it gets the correct answer.
I suggest the function has the
following inputs and outputs
function acc = ComputeAccuracy(X, y, W, b)
where
• each column of X corresponds to an image and X has size d×n.
• y is the vector of ground truth labels of length n.
• acc is a scalar value containing the accuracy.
6. Write the function that evaluates, for a mini-batch, the gradients of
the cost function w.r.t. W and b, that is equations (10, 11). I suggest
the function has the form
function [grad W, grad b] = ComputeGradients(X, Y, P, W, lambda)
where
• each column of X corresponds to an image and it has size d×n.
• each column of Y (K×n) is the one-hot ground truth label for the corresponding column of X.
• each column of P contains the probability for each label for the image
in the corresponding column of X. P has size K×n.
• grad W is the gradient matrix of the cost J relative to W and has size
K×d.
• grad b is the gradient vector of the cost J relative to b and has size
K×1.
Be sure to check out how you can efficiently compute the gradient for
a batch from the last slide of Lecture 3. This can lead to a much faster
implementation (> 3 times faster) than looping through each training
example in the batch.
Everyone makes mistakes when computing gradients. Therefore you
must always check your analytic gradient computations against numerical estimations of the gradients!
Download code from the Canvas
webpage that computes the gradient vectors numerically. Note there
are two versions 1) a slower but more accurate version based on the
centered difference formula and 2) a faster but less accurate based on
the finite difference method.
You will probably have to make small
changes to the functions to make them compatible with your code. It
will take some time to run the numerical gradient calculations as your
network has 32 × 32 × 3 × 10 different parameters in W. Initially, you
should just perform your checks on mini-batches of size 1 and with no
regularization (lambda=0).
Afterwards you can increase the size of the
mini-batch and include regularization into the computations. Another
trick is that you should send in a reduced dimensionality version of
trainX and W, so instead of
[ngrad b, ngrad W] = ComputeGradsNumSlow(trainX(:, 1), trainY(:, 1),
W, b, lambda, 1e-6);
you can compute the gradients numerically for smaller inputs (the first
20 dimensions of the first training example)
[ngrad b, ngrad W] = ComputeGradsNumSlow(trainX(1:20, 1), trainY(:, 1),
W(:, 1:20), b, lambda, 1e-6);
You should then also compute your analytical gradients on this reduced
version of the input data with reduced dimensionality. This will speed
up computations and also reduce the risk of numerical precision issues
(very possible when the full W is initialized with very small numbers
and trainX also contains small numbers).
You could compare the numerically and analytically computed gradient vectors (matrices) by examining their absolute differences and
declaring, if all these absolutes difference are small (<1e-6), then they
have produced the same result. However, when the gradient vectors
have small values this approach may fail.
A more reliable method is to
compute the relative error between a numerically computed gradient
value gn and an analytically computed gradient value ga
|ga − gn|
max(eps, |ga| + |gn|)
where eps a very small positive number
and check this is small.
There are potentially more issues that can
plague numerical gradient checking (especially when you start to train
deep rectifier networks), so I suggest you read the relevant section of
the Additional material for lecture 3 from Standford’s course Convolutional Neural Networks for Visual Recognition for a more
thorough exposition especially for the subsequent assignments.
Do not continue with the rest of this assignment until you are sure
your analytic gradient code is correct. If you are having problems, set
the seed of the random number generator with the command rng to
ensure at each test W and b have the same values and double/triple
check that you have a correct implementation of the gradient equations
from the lecture notes.
7. Once you have the gradient computations debugged you are now ready
to write the code to perform the mini-batch gradient descent algorithm
to learn the network’s parameters where the updates are defined in
equations (8, 9).
You have a couple of parameters controlling the
learning algorithm (for this assignment you will just implement the
most vanilla version of the mini-batch gradient descent algorithm, with
no adaptive tuning of the learning rate or momentum terms):
• n batch the size of the mini-batches
• eta the learning rate
• n epochs the number of runs through the whole training set.
As the images in the CIFAR-10 dataset are in random order, the easiest to generate each mini-batch is to just run through the images
sequentially. Let n batch be the number of images in a mini-batch.
Then for one epoch (a complete run through all the training images),
you can generate the set of mini-batches with this snippet of code:
for j=1:N/n batch
j start = (j-1)*n batch + 1;
j end = j*n batch;
inds = j start:j end;
Xbatch = Xtrain(:, j start:j end);
Ybatch = Ytrain(:, j start:j end);
end
(A slight upgrade of this default implementation is to randomly shuffle
your training examples before each epoch. One efficient way to do this
is via the command randperm which when given the input N returns a
vector containing a random permutation of the integers 1:N.)
I suggest
the mini-batch learning function has these inputs and outputs
function [Wstar, bstar] = MiniBatchGD(X, Y, GDparams, W, b, lambda)
where X contains all the training images, Y the labels for the training
images, W, b are the initial values for the network’s parameters, lambda
is the regularization factor in the cost function and
• GDparams is an object containing the parameter values n batch, eta
and n epochs
For my initial experiments I set n batch=100, eta=.01, n epochs=20
and lambda=0. To help you debug I suggest that after each epoch you
compute the cost function and print it out (and save it) on all the
training data.
For these parameter settings you should see that the
training cost decreases for each epoch. After the first epoch my cost
score on all the training data was 2.0276 where I had set the random
number seed generator to rng(400) and I had initialized the weight
matrix before the bias vector. In figure 7 you can see the training cost
score when I run these parameter settings for 40 epochs. The cost
score on the validation set is plotted in red in the same figure.
(Note: in Tensorflow and other software packages they count in the
number of update steps as opposed to the number of training epochs.
Given N training images and mini-batches of size n batch then one
epoch will result in N/n batch update steps of the parameter values.
Obviously, the performance of the resulting network depends on the
number of update steps and how much training data is used. Therefore
to make fair comparisons one should make sure the number of update
steps is consistent across runs – for instance if you change the size of
the mini-batch, number of training images, etc.. you may need to run
more or fewer epochs of training.)
10 20 30 40
1.7
1.8
1.9
2
epoch
loss training loss
validation loss
Figure 3: The graph of the training and validation loss computed after every epoch.
The network was trained with the following parameter settings: n batch=100,
eta=.01, n epochs=40 and lambda=0.
When you have finished training you can compute the accuracy of
your learnt classifier on the test data. My network achieves (after
40 epochs) an accuracy of 36.69% (without a random shuffling of the
training example at the beginning of each epoch). Much better than
random but not great. Hopefully, we will achieve improved accuracies
in the assignments to come when we build and train more complex
notworks.
After training you can also visualization the weight matrix W as an image and see what class templates your network has learnt. Figure
4 shows the templates my network learnt. Here is a code snippet to
re-arrange each row of W (assuming W is a 10×d matrix) into a set of
images that can be displayed by Matlab.
for i=1:10
im = reshape(W(i, :), 32, 32, 3);
s im{i} = (im – min(im(:))) / (max(im(:)) – min(im(:)));
s im{i} = permute(s im{i}, [2, 1, 3]);
end
Then you can use either imshow, imagesc or montage to display the
images in the cell array s im.
Figure 4: The learnt W matrix visualized as class template images. The network was
trained with the following parameter settings: n batch=100, eta=.01, n epochs=40
and lambda=0.
To complete the assignment:
In the DD2424 Canvas Assignment 1 page upload:
• The final (and cleaned up) version of the code you have written for
the assignment. If you have written the code in multiple files please
place them in one file for the uploaded version. This will make the life
easier for me and the TAs! Note we will not run your code.
• A pdf document containing a short report. Please do not zip your
code and pdf document into on file and upload the zip file. Then the
graders have to download the zip file and uncompress it and this is time
consuming. Here you should state whether you successfully managed
to write the functions to correctly compute the gradient analytically,
what tests you ran to check against the numerically computed gradient and the results of these tests.
You should include the following
plots/figures
1. Graphs of the total loss and the cost function on the training data and
the validation data after each epoch of the mini-batch gradient descent
algorithm.
2. Images representing the learnt weight matrix after the completion of
training.
for the following parameter settings
– lambda=0, n epochs=40, n batch=100, eta=.1
– lambda=0, n epochs=40, n batch=100, eta=.01
– lambda=.1, n epochs=40, n batch=100, eta=.01
– lambda=1, n epochs=40, n batch=100, eta=.01
You should also report the final test accuracy your network achieves
after each of these training runs. You should also make a short comment on the effect of increasing the amount of regularization and the
importance of the correct learning rate.
Exercise 2: Optional for Bonus Points
1. Optimize the performance of the network It would be interesting to discover what is the best possible performance achievable by
Assignment 1’s network on CIFAR-10. Here are some tricks/avenues
you can explore to help bump up performance (most of the same tricks
you can use for more complicated networks and they would probably
have more of an effect for those networks):
(a) Use all the available training data for training (all five batches minus a small
subset of the training images for a validation set). Decrease the size of the
validation set down to ∼ 1000.
(b) Train for a longer time and use your validation set to make sure you don’t
overfit or to keep a record of the best model before you begin to overfit.
(c) Do a grid search to find good values for the amount of regularization, the
learning rate and the batch size. There is some significant empirical evidence
that training with smaller batch sizes when using SGD training leads to better
generalization.
(d) Play around with decaying the learning rate by a factor ∼ .9 after each epoch.
Or you can decay the learning rate by a factor of 10 after every nth epoch.
(e) Implement Xavier initialization and comment if it stabilizes training.
(f) Augment your training data by applying small random geometric and photometric jitter to the original training data. You can do this on the fly by
applying a random jitter to each image in the mini-batch before doing the
forward and backward pass.
(g) Shuffle the order of your training examples at the beginning of every epoch.
(h) Train several networks using different initializations of the network’s parameters. Then apply the ensemble to a test image and use the majority vote as
the prediction.
Bonus Points Available: 1 point (if you complete at least 3 improvements – you can follow my suggestions and/or think of your own or
some combination.)
2. Train network by minimizing the SVM multi-class loss instead
of the cross-entropy loss. Of course, you will have to write a new
function to compute the gradients.
Bonus Points Available: 2 points
To get the bonus point(s) you must upload the following to the Canvas page
Assignment 1 Bonus Points:
1. Your code.
2. A Pdf document which
– reports on your trained network with the best test accuracy,
what improvements you made and which ones brought the largest
gains. (Exercise 2.1)
– compares the test accuracy of the network trained with the SVM
loss compared to the cross-entropy loss (for several sensible training parameter settings). (Exercise 2.2)