Description
In this assignment you will train a ConvNet to predict the language of a
surname from its spelling. The inspiration for the assignment’s task is the
tutorial Classifying Names with a Character-Level RNN.
As in this tutorial
to make the task slightly harder I removed all accented letters, frequently
language specific, and replaced them with their non-accented version. We
are thus left with the English spelling of each name. As an example the
character S was replaced by S. To a native English speaker this seems per- ˇ
fectly reasonable, but I’m aware that this may seem odd to everybody else.
Apologies!
The inspiration for applying a ConvNet to a string of characters is the paper
Character-level Convolutional Networks for Text Classification and in this
assignment you will implement a simplified version of the network they use.
The final version of your code should contain these major components:
• Preparing Data: Read in the training data, determine the maximum
length of all the names in the dataset, determine the number of unique
characters in the text and set up mapping functions – one mapping
each character to a unique index and another mapping each index to
a character.
• Back-propagation: The forward and the backward pass of the backpropagation algorithm for a ConvNet (with multiple convolutional layers and one fully connected layer) to efficiently compute the gradients
for a mini-batch.
• Mini-batch gradient descent with momentum to update your
ConvNet’s parameters.
• Sampling of the training data during back-prop training to compensate for the fact that the training data contains unbalanced classes.
For example one language has over 9, 000 examples while others have
less than one hundred.
• Functions to compute the accuracy, loss and confusion matrix on
the validation set given the current values of the ConvNet’s parameters.
Background 1: A two layer ConvNet
The mathematical details of the ConvNet you will implement are now described. Each input in its original form is a fixed length sequence of nlen
letters. Each letter will be represented by a one-hot encoding vector of
length d. Each name, therefore will be encoded as a matrix of size d × nlen
corresponding to the concatenation of the one-hot (column) vectors for each
letter in the name.
(If a name has less then nlen letters then we will pad
the input matrix with column vectors of 0’s until the input matrix has nlen
columns.) Let X represent this input matrix.
The network function that
you will apply to the input X is:
x
(1)
i = max(0, X ∗ F1i) for i = 1, . . . , n1 (1)
X(1) =
x
(1)
1
, x
(1)
2
, . . . , x
(1)
n1
T
(2)
x
(2)
i = max(0, X(1) ∗ F2i) for i = 1, . . . , n2 (3)
X(2) =
x
(2)
1
, x
(2)
2
, . . . , x
(2)
n2
T
(4)
s = Wvec(X(2)) (5)
p = SoftMax(s) (6)
where
• The d × nlen input matrix X is sparse and has at most one non-zero
(=1) element per column.
• Each filter F1i has size d×k1 and is applied with stride 1 and no zeropadding. We denote the set of layer 1 filters as F1 = {F1i
, . . . F1,n1
}.
• Each x
(1)
i
has size nlen1 × 1 where nlen1 = nlen − k1 + 1 which =⇒
X(1) has size n1 ×nlen1
. Each row of X(1) corresponds to the output of
convolving the input with one of the filters. We use this arrangement
of the data to be consistent with Classifying Names with a CharacterLevel RNN.
• Each filter F2i has size n1×k2 and is applied with stride 1 and no zeropadding. We denote the set of layer 2 filters with F2 = {F2i
, . . . F2,n2
}.
• Each x
(2)
i
has size nlen2 × 1 where nlen2 = nlen1 − k2 + 1 which =⇒
X(2) has size n2 × nlen2
.
• W, the weight matrix for the fully connected layer, has size K×n2∗nlen2
=⇒ s has size K × 1.
• In the lecture notes I have the convention that the operator vec(·)
makes a vector by traversing the elements of the matrix row by row
from left to right.
However, in this assignment to make things consistent with Matlab’s (:) operation (and how it stores data matrix in
memory) we will assume that vec(·) makes a vector by traversing the
elements column by column from top to bottom. This simple example
illustrates:
X =
X11 X12
X21 X22
=⇒ vec(X) =
X11
X21
X12
X22
I have omitted all bias terms in the network for the sake of clarity and
simplicity of implementation.
Background 2: Writing the convolution as a matrix multiplication I
To make the back-propagation algorithm transparent and relatively efficient
(as we will be running this on CPU not a GPU and to take computational
advantage of the sparse input data) we will set up the convolutions performed as matrix multiplications.
In the following we will show how you
can set up the appropriate matrix based on the entries of the applied filter
for a small example. Let X be a 4 × 4 input matrix and F be a 4 × 2 filter:
X =
X11 X12 X13 X14
X21 X22 X23 X24
X31 X32 X33 X34
X41 X42 X43 X44
, F =
F11 F12
F21 F22
F31 F32
F41 F42
. (7)
When we convolve X with F (stride 1 and no zero-padding), the output
vector has size 3 × 1.
s = X ∗ F =⇒ s = Mfilter
F,4 vec(X) (8)
where Mfilter
F,4
is 3 × 16 and the subscript 4 denotes the number of rows in X.
The latter is needed to specify the number of columns in Mfilter
F,4
given the
width of F. Thus
Mfilter
F,4 =
F11 F21 F31 F41 F12 F22 F32 F42 0 0 0 0 0 0 0 0
0 0 0 0 F11 F21 F31 F41 F12 F22 F32 F42 0 0 0 0
0 0 0 0 0 0 0 0 F11 F21 F31 F41 F12 F22 F32 F42!
(9)
=
← v
T → 0
T
4 0
T
4
0
T
4 ← v
T → 0
T
4
0
T
4 0
T
4 ← v
T →
(10)
where v = vec(F). Hopefully you can glean the pattern for the construction
of Mfilter
F,nlen
when X has size d × nlen and F has size d × k (where k ≤ nlen).
In the assignment you will apply multiple filters at each convolutional layer.
Continuing with our simple example let’s say we apply 2 filters F1, F2:
si = X ∗ Fi for i = 2 (11)
and these are stacked to get the response map:
S =
s1, s2
T
. (12)
We can apply these two convolutions with this matrix multiplication
vec(S) =
← VF1,F2 → 02×4 02×4
02×4 ← VF1,F2 → 02×4
02×4 02×4 ← VF1,F2 →
= Mfilter
F1,F2,4vec(X) (13)
where VF1,F2
is 2 × 8 matrix with
VF1,F2 =
vec(F1)
T
vec(F2)
T
!
(14)
You can, of course, then apply nf filters to vec(X) with this matrix:
Mfilter
F,4 =
←− VF −→ 0nf ×4 0nf ×4
0nf ×4 ←− VF −→ 0nf ×4
0nf ×4 0nf ×4 ←− VF −→
(15)
where F = {F1, . . . , Fnf
} and VF is a nf × 8 matrix with
VF =
vec(F1)
T
.
.
.
vec(Fnf
)
T
(16)
Benefits of this formulation
Efficient forward pass of back-prop During training we need to construct Mfilter
Fi,nleni−1
for each layer i after each update of the parameters. However, once we have this matrix it is trivial to apply it to a mini-batch of
data.
For a mini-batch of input training data, X1, . . . , Xn then
vec(S1) . . . vec(Sn)
= Mfilter
Fi,nleni−1
vec(X1) . . . vec(Xn)
(17)
where each Sj is the output of convolving Xj with the filters Fi = {Fi1, . . . , Fi,ni
}.
Simple and efficient formulation to propagate gradient through a
convolution As was shown in the lectures if we write the convolution as
a matrix multiplication in the form just described then the gradient of the
loss w.r.t. X(1) can be written as
∂l
∂vec(X(1))
=
∂l
∂vec(S(2))
Mfilter
F2,nlen1
(18)
This matrix multiplication has to be performed for each input in the minibatch and is done so efficiently with this formulation.
Background 3: Writing the convolution as a matrix multiplication II
For the back-propagation algorithm we will also need to write the convolution as a matrix multiplication where the matrix is made up of the elements
of the input data. Let’s go back to our example where X is the 4 × 4 input
matrix and F is 4 × 2 filter (see equation (7)). When we convolve X with
F (stride 1 and no zero-padding) the output vector has size 3 × 1. We can
write this convolution as a matrix convolution:
s = X ∗ F =⇒ s = Minput
X,2
vec(F) (19)
where Minput
X,2
is a matrix of size 3 × 8 where the subscript 2 corresponds to
the width of the filter F and
Minput
X,2 =
X11 X21 X31 X41 X12 X22 X32 X42
X12 X22 X32 X42 X13 X23 X33 X43
X13 X23 X33 X43 X14 X24 X34 X44
(20)
=
vec(X:,1:2)
T
vec(X:,2:3)
T
vec(X:,3:4)
T
(21)
The notation X:,1:2 indicates the 4×2 sub-matrix of X corresponding to the
columns 1 and 2 of X. Conceptually the rows of Minput
X,2
correspond to the
all sub-blocks of X which are involved with a dot-product with F when the
convolution with stride 1 and no zero-padding is applied.
In the assignment you will apply multiple filters at each convolutional layer.
As in the previous section first consider the case where 2 filters F1 and F2
are applied (see equation (11)):
vec(S) =
vec(X:,1:2)
T 08
08 vec(X:,1:2)
T
vec(X:,2:3)
T 08
08 vec(X:,2:3)
T
vec(X:,3:4)
T 08
08 vec(X:,3:4)
T
vec(F1)
vec(F2)
!
= Minput
X,2,2
vec(F) (22)
5
where the third subscript in Minput
X,2,2
refers to the number of filters we want to
apply.
Note we can use the Kroneckor product to streamline the notation:
Minput
X,2,2 =
I2 ⊗ vec(X:,1:2)
T
I2 ⊗ vec(X:,2:3)
T
I2 ⊗ vec(X:,3:4)
T
(23)
However, at implementation time equation (23) is a very slow way to calculate Minput
X,ki,ni
as it introduces lots of unnecessary multiplications by 0.
Though it is a useful formulation as you are probably more likely to write
a bug free implementation with this formulation than with a more efficient
calculation of Minput
X,ki,ni
.
We can then apply 3 filters with this matrix:
Minput
X,2,3 =
vec(X:,1:2)
T 0
T
8 0
T
8
0
T
8
vec(X:,1:2)
T 0
T
8
0
T
8 0
T
8
vec(X:,1:2)
T
vec(X:,2:3)
T 0
T
8 0
T
8
0
T
8
vec(X:,2:3)
T 0
T
8
0
T
8 0
T
8
vec(X:,2:3)
T
vec(X:,3:4)
T 0
T
8 0
T
8
0
T
8
vec(X:,3:4)
T 0
T
8
0
T
8 0
T
8
vec(X:,3:4)
T
(24)
=
I3 ⊗ vec(X:,1:2)
T
I3 ⊗ vec(X:,2:3)
T
I3 ⊗ vec(X:,3:4)
T
(25)
Hopefully now you begin to see the pattern and you can extrapolate what
the matrix Minput
X,2,nf
will be.
The purpose of this formulation is that we can write the gradient of the loss
w.r.t. filters, vec(Fi), at layer i, in our two layer network simply as
∂L
∂vec(Fi)
=
∂L
∂vec(S(i))
Minput
X(i−1),ki,ni
(26)
Background 4: Equations for the back-propagation algorithm
As per usual let L represent the cost function for the mini-batch of data B
that is
L(B, F1, F2, W) = 1
|B|
X
(x,y)∈B
lcross(x, y, F1, F2, W) (27)
6
and to learn the parameters F1, F2, W we need to compute the gradient of
L w.r.t. each one.
In the description of the back-propagation algorithm we will assume that
the d × n matrix Xbatch contains the input vectors for each training example
in the mini-batch. Thus each column of Xbatch corresponds to the vectorised
version of the corresponding input vector.
The notation in the following is
slightly inconsistent at times (but hopefully understandable). I refer to the
convolution matrix Minput
X,k2,n2
w.r.t. the vectorized version of input matrices
instead of the 2D array and a couple of times I use Matlab notation to denote
the column of a matrix due to fact that I already have too many sub and
superscripts to start including more!
Forward Pass
• Construct Mfilter
F1,nlen
for the first convolutional layer and Mfilter
F2,nlen1
for
the second convolutional layer.
• Apply the first convolutional layer to the input data
X
(1)
batch = max(Mfilter
F1,nlenXbatch, 0) (28)
• Apply the second convolutional layer
X
(2)
batch = max(Mfilter
F2,nlen1
X(1)
, 0) (29)
• Apply the fully connected matrix
Sbatch = W X(2)
batch (30)
• Apply the SoftMax operator to each column of Sbatch to get Pbatch.
For the backward pass of the back-prop algorithm you will need to keep a
record of X
(1)
batch, X(2)
batch, Pbatch.
Backward Pass
• Set all the entries in the gradient matrix and vectors, ∂L
∂W ,
∂L
∂vec(F1)
and
∂L
∂vec(F2)
to zero.
• Let
Gbatch = −(Ybatch − Pbatch) (31)
7
• Then the gradient of l w.r.t. W is given by
∂L
∂W =
1
n
Xn
j=1
Gbatch(:, j)X
(2)
batch(:, j)
T =
1
n
GbatchX
(2)T
batch (32)
• Propagate the gradient through the fully connected layer and the second ReLu function.
Gbatch = WTGbatch (33)
Gbatch = Gbatch Ind(X
(2)
batch > 0) (34)
where represents element-by-element multiplication (i.e. .∗ in Matlab
notation) of two matrices.
• Compute the gradient w.r.t. the second layer convolutional filters:
for j = 1, . . . , n
gj = Gbatch(:, j) (35)
xj = X(1)(:, j) (36)
v = g
T
j Minput
xj ,k2,n2
(37)
∂L
∂vec(F2)
=
∂L
∂vec(F2)
+
1
n
v (38)
• Propagate the gradient to the previous layer through the second convolutional layer and first ReLu operation:
Gbatch =
Mfilter
F2,
T Gbatch (39)
Gbatch = Gbatch Ind(X
(1)
batch > 0) (40)
• Compute the gradient w.r.t. the first layer convolutional filters:
for j = 1, . . . , n
gj = Gbatch(:, j) (41)
xj = Xbatch(:, j) (42)
v = g
T
j Minput
xj ,k1,n1
(43)
∂L
∂vec(F1)
=
∂L
∂vec(F1)
+
1
n
v (44)
The description of how to implement the back-prop algorithm I have given
is somewhat optimized for implementation in Matlab. In many of the deep
learning software packages the back-propagation algorithm is frequently implemented via a formulation similar to this. Though of course there is a
trade-off between speed and memory consumption. For a certain size of
problem it may not be possible to perform the back-prop as described.
Background 5: Extra measures for a more efficient implementation
The matrix multiplications performed in equations (37) and (43) are quite
computationally wasteful because each Minput
xj ,ki,ni
has lots of zeros (see equation 24). Thus there are few options to make the computations more efficient
• Pre-compute for the first layer the Minput
xj ,k1,n1
’s and make each one
sparse. This is worthwhile because each Minput
xj ,k1,n1
only depends on
the input data and remains fixed throughout training. Also remember
each input data-point is encoded as a sparse matrix/vector so Mxj ,k1,n1
is really very sparse and there are big performance gains to be made
with this simple step.
• Unfortunately each Minput
xj ,k2,n2
cannot be pre-computed, made sparse
and then stored because each one is constructed from the responses
after applying the first convolution layer and this will change every
time the network’s parameters change. Also the vectors xj at layer 1
are not necessarily sparse.
However, do not despair you can still make
improvements. Instead of creating Minput
xj ,k2,n2
you should create Minput
xj ,k2
(the generalization of equation (21) to filters of width k2 and input of
size n1 × nlen1
) and then you can write
v = g
T
j Minput
xj ,k2,n2
(45)
equivalently as
V = MT
xj ,k2
gj,1 gj,2 · · · gj,n2
gj,1+n2 gj,2+n2
· · · gj,n2+n2
.
.
.
.
.
.
.
.
.
.
.
.
gj,1+(nlen2−1)n2
gj,2+(nlen2−1)n2
· · · gj,n2+(nlen2−1)n2
(46)
s.t. vec(V ) = v. It is left as an exercise to show that this is true.
Background 6: Compensating for unbalanced training data
In this assignment you encounter unbalanced training data for the first time.
If you do not compensate for this fact you run the risk that your ConvNet
will just learn to classify the dominating classes correctly. There are several
9
practical ways you can overcome this problem. One way is to define a new
loss function for a mini-batch:
Lupsampled(B, F1, F2, W) = 1
|B|
X
(X,y)∈B
py l(X, y, F1, F2, W) (47)
where py is a weight applied to the loss for each individual training example in the batch. The value py has a constant value for each class and is
commonly set to
py =
1
K
1
ny
(48)
where ny is the total number of training examples (in the whole training set)
with label y and K is the number of classes. If you use this loss function then
you effectively down-weight the loss from the popular classes much more
aggressively than those from the rare classes.
The updated loss function
then transfers to the gradient calculations in a simple way, for example
∂Lupsampled
∂F1
=
1
|B|
X
(X,y)∈B
py
∂l(X, y, F1, F2, W)
∂F1
(49)
If you use this loss function you will still use the same back-prop equations
described in equations (31) – (44) except in equations (32), (38) and (44)
where you sum up the individual gradients of the examples in the batch. For
these equations you will have to include the re-weighting factor associated
with each class.
The other option is that for each epoch of training you randomly sample the
same number (this number is usually set to the number of examples from
the smallest class) of examples from each class and this becomes the effective
training set for this epoch. This latter option is what I implemented for the
assignment and it worked fine. You are free to choose either approach.
Exercise 1: Implement and train a two-layer ConvNet
In the following I will sketch the different parts you will need to write to
complete the assignment. Note it is a guideline. You can, of course, have
a different design, but you should read the outline to help inform how different parameters and design choices are made and how I have gone about
optimizing calculations that need to be computed.
0.1 Read in the data & get it ready
First you need to read in the dataset from the text file ascii names.txt
available the Canvas website. To save you some time I have written Matlab
code (available from the Canvas page) that will read in the contents of this
text file and put all the names and their corresponding labels in the cell
array all names and a vector ys respectively. The code also converts all
uppercase letters to lowercase.
You have access to all the characters used to generate the names. To get a
vector containing the unique characters you can apply this Matlab code:
C = unique(cell2mat(all names));
d = numel(C);
Once you have this list then its length d corresponds to the dimensionality
of the one-hot vector used to encode each character (d in the Background
1 section). You can also examine the contents in all names to get the
maximum length of name in the dataset and set n len to this value. You
can also use the unique function on ys to get the number of classes K we
are trying to predict (it is 18).
To allow you to easily go between a character and its one-hot encoding, you
should initialize a map container of the form:
char to ind = containers.Map(’KeyType’,’char’,’ValueType’,’int32’);
Then for char to ind you should fill in the characters in your alphabet
as its keys and create an integer for its value (keep things simple and use
where the character appears in the vector C as its value).
You will use this
map container to convert a name into a matrix of size d × n len. The
ith column of this matrix should correspond to the one-hot encoding of the
ith letter in the name. If a name has length less then n len then the last
(n len – n name+1) columns of the encoding matrix should be set to all
zeros. While if a test name has length greater than n len you can choose
what to do.
The most obvious choices are to either use the first n len or
the last n len characters in the name and ignore the other ones.
Encode your input names I recommend that you convert each name
into its matrix encoding and then vectorize it, via the (:) operator, into a
vector of length d * n len. You can then store all the vectorized inputs
into a matrix X of size (d * n len) × N where N is the number of names
in the dataset. If you save X you can then load it quickly each time you run
your code.
For this assignment we will just partition the data into a training and validation set. (Not a great machine learning practice but the main purpose of this
assignment is get the training of a ConvNet up and running efficiently……)
You can download the indices of the inputs I used for the validation set,
Validation Inds.txt from Canvas webpage. I assume the same ordering
of the names in the original input file.
0.2 Set hyper-parameters & initialize the ConvNet’s parameters
The hyper-parameters you need to define the ConvNet’s architecture are
• the number of filters applied at layer 1: n1
• the width of the filters applied at layer 1: k1 (Remember the filter
applied will have size d × k1.)
• the number of filters applied at layer 2: n2
• the width of the filters applied at layer 2: k2 (Remember the filter
applied will have size n1 × k2.)
The other hyper-parameters you need to set are those associated with training and these are the learning rate eta and the momentum term rho. I used
the settings .001 and .9.
In my code I found it easiest to store the parameters of the model in an
object called ConvNet. In this object I defined a cell array containing the
weight/filter matrices for each layer. For example:
ConvNet.F{1} = randn(d, k1, n1)*sig1;
ConvNet.F{2} = randn(n1, k2, n2)*sig2;
ConvNet.W = randn(K, fsize)*sig3;
where fsize is the number of elements in X(2) in equation (4). I use He
initialization to set sig1, sig2 and sig3. However, you should be careful in
setting sig1 as the input vector is very sparse with the non-zero elements
equalling 1 and this changes the assumptions of He initialization. How
should you change the formula?
0.3 Construct the convolution matrices
Before you begin to write the code to compute the gradients you need
to write one function that will allow to construct the matrices Mfilter
F1,nlen
,
Mfilter
F2,nlen1
and another that will allow you to construct Minput
x,k1,n1
and Minput
x,k2,n2
.
For the later matrices we have assumed the input vector is already in its
vectorized form as this is the form you will have during training and testing. (Remember in Matlab you can transfer between the matrix and vector
encodings of an input example with
x input = X input(:)
X input = reshape(x input, [d, nlen]);
where X input has size d × nlen. Note if you have x input then you
either need to know d or nlen to do the reshaping as you can find the other
dimension of the 2D matrix by dividing the length of x input by the known
dimension.)
It is crucial that you get your implementation of these functions correct as
they are central for computing the gradients. Now for some hand-holding!
The first function you write (returning the matrix, based on the entries in
the convolutional filter, to perform all the convolutions at a given layer)
should take as input the 3D array containing the convolutional filters at one
layer and the number of columns of the input to that layer. So if it is layer 1
then this latter number is nlen. The declaration of the function could look
something like
function MF = MakeMFMatrix(F, nlen)
The function then outputs the matrix MF of size (nlen-k+1)*nf × nlen*dd
where
[dd, k, nf] = size(F);
The second function you write (returning the matrix, based on the entries
in the input vector, to perform all the convolutions at a given layer) should
take as input the vector of size × 1 corresponding the vectorized version
of the input to the convolutional layer, the height and width of each filter
applied and the number of filters that will be applied. So if it is layer 1 and
you will apply the 2D filters stored in ConvNet.F{1} then the numbers you
need are
[d, k, nf] = size(ConvNet.F{1})
The declaration of the function could look something like
function MX = MakeMXMatrix(x input, d, k, nf)
The function then outputs the matrix MX of size (nlen-k+1)*nf × k*nf*d.
The first debug check you can perform is that for the first convolutional
layer you compute MX for one training input x input and MF. Then you can
compute
s1 = MX * ConvNet.F{1}(:);
s2 = MF * x input;
and check s1 and s2 are the same.
To also help confirm you have a bugfree implementation, you can download from the Canvas webpage the file
DebugInfo.mat. From this Matlab data file you can upload the following
variables:
• name – the 1d array of characters containing the original name of the input.
• X input – the matrix numerical encoding of name of size 28 × nlen
• x input – the 28*nlen × 1 vector corresponding to the vectorised version of X input
• F – the 28 × 5 × 4 3D array corresponding to the 4 convolutional filters applied
to X input. The first filter corresponds to F(:, :, 1), the second to F(:, :, 2),
etc.
• vecF – the 28*5*4 × 1 vector corresponding to the vectorized version of F. Note
in Matlab the operation F(:) vectorises F by first flattening F(:, :, 1) then F(:,
:, 2) etc.
13
• S – the 4 × nlen-5+1 matrix corresponding to the output of convolving X input
with the filters stored in F.
• vecS – the 4*(nlen-4) × 1 vector corresponding to the vectorized version of S.
From x input and F you can use your newly written functions to compute
MX and MF and then check if applying these matrices results in the vector
vecS. You can also check that if you reshape your calculated version of vecS
you get S.
Once you have a bug-free version of MakeMXMatrix and MakeMFMatrix then
you can spend some time optimizing your implementation especially (if you
have used the function kron to build MX). Remember the profile command
is your friend when optimizing code in Matlab.
0.4 Implement the forward & backward pass of back-prop
Now you are ready to implement the back-propagation algorithm described
by the equations in the Background 4 section. I suggest you first start with
a mini-batch of size 1, then 2 and then 3 and also that you have a relatively
small number of filters per layer (say 5).
When I did the numerical checks
of my analytical gradients, I applied 5 filters of width 5 in both layers. To
perform the numerical checks of your gradient you will need to write the
function to compute the loss for a mini-batch of data, see equation (27).
function loss = ComputeLoss(X batch, Ys batch, MFs, W)
where X batch is matrix of size nlen*d × n that contains all the vectorised
input data (each column corresponds to one vectorized datapoint), Y batch
is the one-hot encoding of the labels of each example, MFs is a cell array
containing the MF matrix for each layer and W is the 2D weight matrix for
the last fully connected layer.
Here I suggest you use the MF matrices to
carry out the convolutions as this will be the most efficient way to compute
the loss on the validation set during training (if you don’t explicitly exploit
the sparse of X batch).
From the course Canvas page you can download my code for computing the
numerical gradients. You will, of course, have to adapt it to fit your code.
As per usual you should take care to make your numerical and analytical
gradients match before you move on to the next stage.
0.5 Train using mini-batch gradient descent with momentum
Once you have convinced yourself that your analytic gradient calculations
are correct, you are ready to write code to start training the parameters of
your network. I suggest that you begin with an implementation of ordinary
mini-batch gradient descent.
When you have implemented that then upgrade the optimizer to include a momentum term. Just one technical detail
in the back-prop equations you compute – the gradients you compute are
14
all vectors but you will probably have stored the convolutional filters as 3D
arrays. You can use the reshape function to
grad F1 = reshape(grad vecF1, [d, k1, n1]);
I suggest you get the training algorithm working for a small sized network
for example filters of size 5 or 3 at each layer and then ∼10 filters at each
layer and mini-batch size 100. After every nupdateth update step you should
compute the validation loss and accuracy. I set nupdate = 500. What I also
found useful was to also regularly compute and print out the confusion
matrix generated by the network when applied to the validation set. The
confusion matrix is the K × K matrix M where Mij is the number of examples with label i that are classified as class j by your classifier.
By examining
this matrix you can check whether the unbalanced data is causing a problem
such as all examples are classified to a small subset of the classes.
When you do the initial experiments with your small network you should
get sufficiently good results to indicate
1. the speed bottlenecks in your code when you use the profile command,
2. whether some form of useful learning is occurring and
3. whether the unbalanced dataset is causing a problem or not.
At this stage you can decide to implement the computational efficiencies
described in the Background 5 section especially if you want to train a
much bigger network for ∼100,000 update steps and to ensure it will not take
too long. Another tip is that Matlab makes copies of everything when you
pass variables to functions (there really is no concept of passing by reference).
Thus in the innermost loops it makes sense to cut-and-paste code from the
functions you call. Yes, it makes things much less readable but it in many
cases it saves lots of time. And as always in Matlab pre-allocate memory for
matrices whenever you can!
Once you have a relatively efficient implementation then you can start to
do some serious training and search for a good parameter setting. What I
found, in general, was that the more filters I applied at each layer the better
performance I got in a fewer number of update steps.
0.6 Account for the unbalanced dataset
Because the dataset is quite unbalanced it makes sense to compensate for
this fact, so if at test time the proportion of names from each class is drastically different then during training your network can cope. In the section
Background 6 I describe how to do this. Please implement one of the
alternatives and notice the qualitative differences between training and the
intermediary values of the validation loss, accuracy and the confusion matrices when you compensate and do not compensate for the imbalanced classes.
To complete the assignment:
To pass the assignment you need to upload to Canvas:
1. The code for this assignment.
2. A brief pdf report with the following content:
i) State how you checked your analytic gradient computations and
whether you think that your gradient computations are bug free
for your ConvNet.
ii) Comment on how you compensated for the imbalanced dataset
during training.
iii) Include a graph of the validation loss, equation (47), for a longish
training ≥ 20, 000 update steps for a network with k1 = 5, k2 = 3
and n1 = 20, n2 = 20 when you train without compensating for
the unbalanced dataset and when you do. Include also the final
confusion matrix on the validation set in both cases. Comment
on the qualitative differences between the two different training
regimes.
iv) Include the validation loss function, equation (47), for your best
performing ConvNet. State the final validation accuracy of this
network and its accuracy for each class. Detail the parameter settings and how trained the network and the training parameters.
v) State whether you implemented the efficiency gains in Background 5 and by how much they helped speed up training. You
can quote how long it took to perform a fixed number of update
steps (∼100) with mini-batches of size 100 before and after the
upgrade.
vi) Show the probability vector output by your best network when
applied to your surname and those of 5 of your friends. Comment
on the results!
Exercise 2: Optional for bonus points
We have only scratched the surface of all the tweaks one could apply to make
our ConvNet a better predictor for this problem. They are many options
for improvement
• Implement a max-pooling layer see the reference paper for how they
applied this problem. This will allow you to apply more filters at each
layer because if you reduce nleni
at each layer then you will have less
computational and memory requirements during training for a fixed
number of filters.
• Make your code generic so you can define networks with L convolutional layers. And then perform a search for a good number of layers
and the number of filters applied at each layer. I’m interested in knowing does adding more convolutional layers make things better or does
training become too hard.
• Make your convolution implementation more generic so you can apply
filters with arbitrary strides and zero-padding.
• Include the omitted bias terms at each layer.
• Apply dilated convolutions. You could then get long range correlations
but with the same number of parameters and computational effort.
The list is endless… I will award 2 bonus for each reported significant upgrade implemented. You max out after 6 bonus points. I’m aware, of course,
that an improved network does not necessarily correspond to improved performance, but as long as you implemented something reasonable…