## Description

In this assignment you will train a ConvNet to predict the language of a

surname from its spelling. The inspiration for the assignment’s task is the

tutorial Classifying Names with a Character-Level RNN.

As in this tutorial

to make the task slightly harder I removed all accented letters, frequently

language specific, and replaced them with their non-accented version. We

are thus left with the English spelling of each name. As an example the

character S was replaced by S. To a native English speaker this seems per- ˇ

fectly reasonable, but I’m aware that this may seem odd to everybody else.

Apologies!

The inspiration for applying a ConvNet to a string of characters is the paper

Character-level Convolutional Networks for Text Classification and in this

assignment you will implement a simplified version of the network they use.

The final version of your code should contain these major components:

• Preparing Data: Read in the training data, determine the maximum

length of all the names in the dataset, determine the number of unique

characters in the text and set up mapping functions – one mapping

each character to a unique index and another mapping each index to

a character.

• Back-propagation: The forward and the backward pass of the backpropagation algorithm for a ConvNet (with multiple convolutional layers and one fully connected layer) to efficiently compute the gradients

for a mini-batch.

• Mini-batch gradient descent with momentum to update your

ConvNet’s parameters.

• Sampling of the training data during back-prop training to compensate for the fact that the training data contains unbalanced classes.

For example one language has over 9, 000 examples while others have

less than one hundred.

• Functions to compute the accuracy, loss and confusion matrix on

the validation set given the current values of the ConvNet’s parameters.

## Background 1: A two layer ConvNet

The mathematical details of the ConvNet you will implement are now described. Each input in its original form is a fixed length sequence of nlen

letters. Each letter will be represented by a one-hot encoding vector of

length d. Each name, therefore will be encoded as a matrix of size d × nlen

corresponding to the concatenation of the one-hot (column) vectors for each

letter in the name.

(If a name has less then nlen letters then we will pad

the input matrix with column vectors of 0’s until the input matrix has nlen

columns.) Let X represent this input matrix.

The network function that

you will apply to the input X is:

x

(1)

i = max(0, X ∗ F1i) for i = 1, . . . , n1 (1)

X(1) =

x

(1)

1

, x

(1)

2

, . . . , x

(1)

n1

T

(2)

x

(2)

i = max(0, X(1) ∗ F2i) for i = 1, . . . , n2 (3)

X(2) =

x

(2)

1

, x

(2)

2

, . . . , x

(2)

n2

T

(4)

s = Wvec(X(2)) (5)

p = SoftMax(s) (6)

where

• The d × nlen input matrix X is sparse and has at most one non-zero

(=1) element per column.

• Each filter F1i has size d×k1 and is applied with stride 1 and no zeropadding. We denote the set of layer 1 filters as F1 = {F1i

, . . . F1,n1

}.

• Each x

(1)

i

has size nlen1 × 1 where nlen1 = nlen − k1 + 1 which =⇒

X(1) has size n1 ×nlen1

. Each row of X(1) corresponds to the output of

convolving the input with one of the filters. We use this arrangement

of the data to be consistent with Classifying Names with a CharacterLevel RNN.

• Each filter F2i has size n1×k2 and is applied with stride 1 and no zeropadding. We denote the set of layer 2 filters with F2 = {F2i

, . . . F2,n2

}.

• Each x

(2)

i

has size nlen2 × 1 where nlen2 = nlen1 − k2 + 1 which =⇒

X(2) has size n2 × nlen2

.

• W, the weight matrix for the fully connected layer, has size K×n2∗nlen2

=⇒ s has size K × 1.

• In the lecture notes I have the convention that the operator vec(·)

makes a vector by traversing the elements of the matrix row by row

from left to right.

However, in this assignment to make things consistent with Matlab’s (:) operation (and how it stores data matrix in

memory) we will assume that vec(·) makes a vector by traversing the

elements column by column from top to bottom. This simple example

illustrates:

X =

X11 X12

X21 X22

=⇒ vec(X) =

X11

X21

X12

X22

I have omitted all bias terms in the network for the sake of clarity and

simplicity of implementation.

## Background 2: Writing the convolution as a matrix multiplication I

To make the back-propagation algorithm transparent and relatively efficient

(as we will be running this on CPU not a GPU and to take computational

advantage of the sparse input data) we will set up the convolutions performed as matrix multiplications.

In the following we will show how you

can set up the appropriate matrix based on the entries of the applied filter

for a small example. Let X be a 4 × 4 input matrix and F be a 4 × 2 filter:

X =

X11 X12 X13 X14

X21 X22 X23 X24

X31 X32 X33 X34

X41 X42 X43 X44

, F =

F11 F12

F21 F22

F31 F32

F41 F42

. (7)

When we convolve X with F (stride 1 and no zero-padding), the output

vector has size 3 × 1.

s = X ∗ F =⇒ s = Mfilter

F,4 vec(X) (8)

where Mfilter

F,4

is 3 × 16 and the subscript 4 denotes the number of rows in X.

The latter is needed to specify the number of columns in Mfilter

F,4

given the

width of F. Thus

Mfilter

F,4 =

F11 F21 F31 F41 F12 F22 F32 F42 0 0 0 0 0 0 0 0

0 0 0 0 F11 F21 F31 F41 F12 F22 F32 F42 0 0 0 0

0 0 0 0 0 0 0 0 F11 F21 F31 F41 F12 F22 F32 F42!

(9)

=

← v

T → 0

T

4 0

T

4

0

T

4 ← v

T → 0

T

4

0

T

4 0

T

4 ← v

T →

(10)

where v = vec(F). Hopefully you can glean the pattern for the construction

of Mfilter

F,nlen

when X has size d × nlen and F has size d × k (where k ≤ nlen).

In the assignment you will apply multiple filters at each convolutional layer.

Continuing with our simple example let’s say we apply 2 filters F1, F2:

si = X ∗ Fi for i = 2 (11)

and these are stacked to get the response map:

S =

s1, s2

T

. (12)

We can apply these two convolutions with this matrix multiplication

vec(S) =

← VF1,F2 → 02×4 02×4

02×4 ← VF1,F2 → 02×4

02×4 02×4 ← VF1,F2 →

= Mfilter

F1,F2,4vec(X) (13)

where VF1,F2

is 2 × 8 matrix with

VF1,F2 =

vec(F1)

T

vec(F2)

T

!

(14)

You can, of course, then apply nf filters to vec(X) with this matrix:

Mfilter

F,4 =

←− VF −→ 0nf ×4 0nf ×4

0nf ×4 ←− VF −→ 0nf ×4

0nf ×4 0nf ×4 ←− VF −→

(15)

where F = {F1, . . . , Fnf

} and VF is a nf × 8 matrix with

VF =

vec(F1)

T

.

.

.

vec(Fnf

)

T

(16)

Benefits of this formulation

Efficient forward pass of back-prop During training we need to construct Mfilter

Fi,nleni−1

for each layer i after each update of the parameters. However, once we have this matrix it is trivial to apply it to a mini-batch of

data.

For a mini-batch of input training data, X1, . . . , Xn then

vec(S1) . . . vec(Sn)

= Mfilter

Fi,nleni−1

vec(X1) . . . vec(Xn)

(17)

where each Sj is the output of convolving Xj with the filters Fi = {Fi1, . . . , Fi,ni

}.

Simple and efficient formulation to propagate gradient through a

convolution As was shown in the lectures if we write the convolution as

a matrix multiplication in the form just described then the gradient of the

loss w.r.t. X(1) can be written as

∂l

∂vec(X(1))

=

∂l

∂vec(S(2))

Mfilter

F2,nlen1

(18)

This matrix multiplication has to be performed for each input in the minibatch and is done so efficiently with this formulation.

## Background 3: Writing the convolution as a matrix multiplication II

For the back-propagation algorithm we will also need to write the convolution as a matrix multiplication where the matrix is made up of the elements

of the input data. Let’s go back to our example where X is the 4 × 4 input

matrix and F is 4 × 2 filter (see equation (7)). When we convolve X with

F (stride 1 and no zero-padding) the output vector has size 3 × 1. We can

write this convolution as a matrix convolution:

s = X ∗ F =⇒ s = Minput

X,2

vec(F) (19)

where Minput

X,2

is a matrix of size 3 × 8 where the subscript 2 corresponds to

the width of the filter F and

Minput

X,2 =

X11 X21 X31 X41 X12 X22 X32 X42

X12 X22 X32 X42 X13 X23 X33 X43

X13 X23 X33 X43 X14 X24 X34 X44

(20)

=

vec(X:,1:2)

T

vec(X:,2:3)

T

vec(X:,3:4)

T

(21)

The notation X:,1:2 indicates the 4×2 sub-matrix of X corresponding to the

columns 1 and 2 of X. Conceptually the rows of Minput

X,2

correspond to the

all sub-blocks of X which are involved with a dot-product with F when the

convolution with stride 1 and no zero-padding is applied.

In the assignment you will apply multiple filters at each convolutional layer.

As in the previous section first consider the case where 2 filters F1 and F2

are applied (see equation (11)):

vec(S) =

vec(X:,1:2)

T 08

08 vec(X:,1:2)

T

vec(X:,2:3)

T 08

08 vec(X:,2:3)

T

vec(X:,3:4)

T 08

08 vec(X:,3:4)

T

vec(F1)

vec(F2)

!

= Minput

X,2,2

vec(F) (22)

5

where the third subscript in Minput

X,2,2

refers to the number of filters we want to

apply.

Note we can use the Kroneckor product to streamline the notation:

Minput

X,2,2 =

I2 ⊗ vec(X:,1:2)

T

I2 ⊗ vec(X:,2:3)

T

I2 ⊗ vec(X:,3:4)

T

(23)

However, at implementation time equation (23) is a very slow way to calculate Minput

X,ki,ni

as it introduces lots of unnecessary multiplications by 0.

Though it is a useful formulation as you are probably more likely to write

a bug free implementation with this formulation than with a more efficient

calculation of Minput

X,ki,ni

.

We can then apply 3 filters with this matrix:

Minput

X,2,3 =

vec(X:,1:2)

T 0

T

8 0

T

8

0

T

8

vec(X:,1:2)

T 0

T

8

0

T

8 0

T

8

vec(X:,1:2)

T

vec(X:,2:3)

T 0

T

8 0

T

8

0

T

8

vec(X:,2:3)

T 0

T

8

0

T

8 0

T

8

vec(X:,2:3)

T

vec(X:,3:4)

T 0

T

8 0

T

8

0

T

8

vec(X:,3:4)

T 0

T

8

0

T

8 0

T

8

vec(X:,3:4)

T

(24)

=

I3 ⊗ vec(X:,1:2)

T

I3 ⊗ vec(X:,2:3)

T

I3 ⊗ vec(X:,3:4)

T

(25)

Hopefully now you begin to see the pattern and you can extrapolate what

the matrix Minput

X,2,nf

will be.

The purpose of this formulation is that we can write the gradient of the loss

w.r.t. filters, vec(Fi), at layer i, in our two layer network simply as

∂L

∂vec(Fi)

=

∂L

∂vec(S(i))

Minput

X(i−1),ki,ni

(26)

## Background 4: Equations for the back-propagation algorithm

As per usual let L represent the cost function for the mini-batch of data B

that is

L(B, F1, F2, W) = 1

|B|

X

(x,y)∈B

lcross(x, y, F1, F2, W) (27)

6

and to learn the parameters F1, F2, W we need to compute the gradient of

L w.r.t. each one.

In the description of the back-propagation algorithm we will assume that

the d × n matrix Xbatch contains the input vectors for each training example

in the mini-batch. Thus each column of Xbatch corresponds to the vectorised

version of the corresponding input vector.

The notation in the following is

slightly inconsistent at times (but hopefully understandable). I refer to the

convolution matrix Minput

X,k2,n2

w.r.t. the vectorized version of input matrices

instead of the 2D array and a couple of times I use Matlab notation to denote

the column of a matrix due to fact that I already have too many sub and

superscripts to start including more!

Forward Pass

• Construct Mfilter

F1,nlen

for the first convolutional layer and Mfilter

F2,nlen1

for

the second convolutional layer.

• Apply the first convolutional layer to the input data

X

(1)

batch = max(Mfilter

F1,nlenXbatch, 0) (28)

• Apply the second convolutional layer

X

(2)

batch = max(Mfilter

F2,nlen1

X(1)

, 0) (29)

• Apply the fully connected matrix

Sbatch = W X(2)

batch (30)

• Apply the SoftMax operator to each column of Sbatch to get Pbatch.

For the backward pass of the back-prop algorithm you will need to keep a

record of X

(1)

batch, X(2)

batch, Pbatch.

Backward Pass

• Set all the entries in the gradient matrix and vectors, ∂L

∂W ,

∂L

∂vec(F1)

and

∂L

∂vec(F2)

to zero.

• Let

Gbatch = −(Ybatch − Pbatch) (31)

7

• Then the gradient of l w.r.t. W is given by

∂L

∂W =

1

n

Xn

j=1

Gbatch(:, j)X

(2)

batch(:, j)

T =

1

n

GbatchX

(2)T

batch (32)

• Propagate the gradient through the fully connected layer and the second ReLu function.

Gbatch = WTGbatch (33)

Gbatch = Gbatch Ind(X

(2)

batch > 0) (34)

where represents element-by-element multiplication (i.e. .∗ in Matlab

notation) of two matrices.

• Compute the gradient w.r.t. the second layer convolutional filters:

for j = 1, . . . , n

gj = Gbatch(:, j) (35)

xj = X(1)(:, j) (36)

v = g

T

j Minput

xj ,k2,n2

(37)

∂L

∂vec(F2)

=

∂L

∂vec(F2)

+

1

n

v (38)

• Propagate the gradient to the previous layer through the second convolutional layer and first ReLu operation:

Gbatch =

Mfilter

F2,

T Gbatch (39)

Gbatch = Gbatch Ind(X

(1)

batch > 0) (40)

• Compute the gradient w.r.t. the first layer convolutional filters:

for j = 1, . . . , n

gj = Gbatch(:, j) (41)

xj = Xbatch(:, j) (42)

v = g

T

j Minput

xj ,k1,n1

(43)

∂L

∂vec(F1)

=

∂L

∂vec(F1)

+

1

n

v (44)

The description of how to implement the back-prop algorithm I have given

is somewhat optimized for implementation in Matlab. In many of the deep

learning software packages the back-propagation algorithm is frequently implemented via a formulation similar to this. Though of course there is a

trade-off between speed and memory consumption. For a certain size of

problem it may not be possible to perform the back-prop as described.

## Background 5: Extra measures for a more efficient implementation

The matrix multiplications performed in equations (37) and (43) are quite

computationally wasteful because each Minput

xj ,ki,ni

has lots of zeros (see equation 24). Thus there are few options to make the computations more efficient

• Pre-compute for the first layer the Minput

xj ,k1,n1

’s and make each one

sparse. This is worthwhile because each Minput

xj ,k1,n1

only depends on

the input data and remains fixed throughout training. Also remember

each input data-point is encoded as a sparse matrix/vector so Mxj ,k1,n1

is really very sparse and there are big performance gains to be made

with this simple step.

• Unfortunately each Minput

xj ,k2,n2

cannot be pre-computed, made sparse

and then stored because each one is constructed from the responses

after applying the first convolution layer and this will change every

time the network’s parameters change. Also the vectors xj at layer 1

are not necessarily sparse.

However, do not despair you can still make

improvements. Instead of creating Minput

xj ,k2,n2

you should create Minput

xj ,k2

(the generalization of equation (21) to filters of width k2 and input of

size n1 × nlen1

) and then you can write

v = g

T

j Minput

xj ,k2,n2

(45)

equivalently as

V = MT

xj ,k2

gj,1 gj,2 · · · gj,n2

gj,1+n2 gj,2+n2

· · · gj,n2+n2

.

.

.

.

.

.

.

.

.

.

.

.

gj,1+(nlen2−1)n2

gj,2+(nlen2−1)n2

· · · gj,n2+(nlen2−1)n2

(46)

s.t. vec(V ) = v. It is left as an exercise to show that this is true.

## Background 6: Compensating for unbalanced training data

In this assignment you encounter unbalanced training data for the first time.

If you do not compensate for this fact you run the risk that your ConvNet

will just learn to classify the dominating classes correctly. There are several

9

practical ways you can overcome this problem. One way is to define a new

loss function for a mini-batch:

Lupsampled(B, F1, F2, W) = 1

|B|

X

(X,y)∈B

py l(X, y, F1, F2, W) (47)

where py is a weight applied to the loss for each individual training example in the batch. The value py has a constant value for each class and is

commonly set to

py =

1

K

1

ny

(48)

where ny is the total number of training examples (in the whole training set)

with label y and K is the number of classes. If you use this loss function then

you effectively down-weight the loss from the popular classes much more

aggressively than those from the rare classes.

The updated loss function

then transfers to the gradient calculations in a simple way, for example

∂Lupsampled

∂F1

=

1

|B|

X

(X,y)∈B

py

∂l(X, y, F1, F2, W)

∂F1

(49)

If you use this loss function you will still use the same back-prop equations

described in equations (31) – (44) except in equations (32), (38) and (44)

where you sum up the individual gradients of the examples in the batch. For

these equations you will have to include the re-weighting factor associated

with each class.

The other option is that for each epoch of training you randomly sample the

same number (this number is usually set to the number of examples from

the smallest class) of examples from each class and this becomes the effective

training set for this epoch. This latter option is what I implemented for the

assignment and it worked fine. You are free to choose either approach.

## Exercise 1: Implement and train a two-layer ConvNet

In the following I will sketch the different parts you will need to write to

complete the assignment. Note it is a guideline. You can, of course, have

a different design, but you should read the outline to help inform how different parameters and design choices are made and how I have gone about

optimizing calculations that need to be computed.

0.1 Read in the data & get it ready

First you need to read in the dataset from the text file ascii names.txt

available the Canvas website. To save you some time I have written Matlab

code (available from the Canvas page) that will read in the contents of this

text file and put all the names and their corresponding labels in the cell

array all names and a vector ys respectively. The code also converts all

uppercase letters to lowercase.

You have access to all the characters used to generate the names. To get a

vector containing the unique characters you can apply this Matlab code:

C = unique(cell2mat(all names));

d = numel(C);

Once you have this list then its length d corresponds to the dimensionality

of the one-hot vector used to encode each character (d in the Background

1 section). You can also examine the contents in all names to get the

maximum length of name in the dataset and set n len to this value. You

can also use the unique function on ys to get the number of classes K we

are trying to predict (it is 18).

To allow you to easily go between a character and its one-hot encoding, you

should initialize a map container of the form:

char to ind = containers.Map(’KeyType’,’char’,’ValueType’,’int32’);

Then for char to ind you should fill in the characters in your alphabet

as its keys and create an integer for its value (keep things simple and use

where the character appears in the vector C as its value).

You will use this

map container to convert a name into a matrix of size d × n len. The

ith column of this matrix should correspond to the one-hot encoding of the

ith letter in the name. If a name has length less then n len then the last

(n len – n name+1) columns of the encoding matrix should be set to all

zeros. While if a test name has length greater than n len you can choose

what to do.

The most obvious choices are to either use the first n len or

the last n len characters in the name and ignore the other ones.

Encode your input names I recommend that you convert each name

into its matrix encoding and then vectorize it, via the (:) operator, into a

vector of length d * n len. You can then store all the vectorized inputs

into a matrix X of size (d * n len) × N where N is the number of names

in the dataset. If you save X you can then load it quickly each time you run

your code.

For this assignment we will just partition the data into a training and validation set. (Not a great machine learning practice but the main purpose of this

assignment is get the training of a ConvNet up and running efficiently……)

You can download the indices of the inputs I used for the validation set,

Validation Inds.txt from Canvas webpage. I assume the same ordering

of the names in the original input file.

0.2 Set hyper-parameters & initialize the ConvNet’s parameters

The hyper-parameters you need to define the ConvNet’s architecture are

• the number of filters applied at layer 1: n1

• the width of the filters applied at layer 1: k1 (Remember the filter

applied will have size d × k1.)

• the number of filters applied at layer 2: n2

• the width of the filters applied at layer 2: k2 (Remember the filter

applied will have size n1 × k2.)

The other hyper-parameters you need to set are those associated with training and these are the learning rate eta and the momentum term rho. I used

the settings .001 and .9.

In my code I found it easiest to store the parameters of the model in an

object called ConvNet. In this object I defined a cell array containing the

weight/filter matrices for each layer. For example:

ConvNet.F{1} = randn(d, k1, n1)*sig1;

ConvNet.F{2} = randn(n1, k2, n2)*sig2;

ConvNet.W = randn(K, fsize)*sig3;

where fsize is the number of elements in X(2) in equation (4). I use He

initialization to set sig1, sig2 and sig3. However, you should be careful in

setting sig1 as the input vector is very sparse with the non-zero elements

equalling 1 and this changes the assumptions of He initialization. How

should you change the formula?

0.3 Construct the convolution matrices

Before you begin to write the code to compute the gradients you need

to write one function that will allow to construct the matrices Mfilter

F1,nlen

,

Mfilter

F2,nlen1

and another that will allow you to construct Minput

x,k1,n1

and Minput

x,k2,n2

.

For the later matrices we have assumed the input vector is already in its

vectorized form as this is the form you will have during training and testing. (Remember in Matlab you can transfer between the matrix and vector

encodings of an input example with

x input = X input(:)

X input = reshape(x input, [d, nlen]);

where X input has size d × nlen. Note if you have x input then you

either need to know d or nlen to do the reshaping as you can find the other

dimension of the 2D matrix by dividing the length of x input by the known

dimension.)

It is crucial that you get your implementation of these functions correct as

they are central for computing the gradients. Now for some hand-holding!

The first function you write (returning the matrix, based on the entries in

the convolutional filter, to perform all the convolutions at a given layer)

should take as input the 3D array containing the convolutional filters at one

layer and the number of columns of the input to that layer. So if it is layer 1

then this latter number is nlen. The declaration of the function could look

something like

function MF = MakeMFMatrix(F, nlen)

The function then outputs the matrix MF of size (nlen-k+1)*nf × nlen*dd

where

[dd, k, nf] = size(F);

The second function you write (returning the matrix, based on the entries

in the input vector, to perform all the convolutions at a given layer) should

take as input the vector of size × 1 corresponding the vectorized version

of the input to the convolutional layer, the height and width of each filter

applied and the number of filters that will be applied. So if it is layer 1 and

you will apply the 2D filters stored in ConvNet.F{1} then the numbers you

need are

[d, k, nf] = size(ConvNet.F{1})

The declaration of the function could look something like

function MX = MakeMXMatrix(x input, d, k, nf)

The function then outputs the matrix MX of size (nlen-k+1)*nf × k*nf*d.

The first debug check you can perform is that for the first convolutional

layer you compute MX for one training input x input and MF. Then you can

compute

s1 = MX * ConvNet.F{1}(:);

s2 = MF * x input;

and check s1 and s2 are the same.

To also help confirm you have a bugfree implementation, you can download from the Canvas webpage the file

DebugInfo.mat. From this Matlab data file you can upload the following

variables:

• name – the 1d array of characters containing the original name of the input.

• X input – the matrix numerical encoding of name of size 28 × nlen

• x input – the 28*nlen × 1 vector corresponding to the vectorised version of X input

• F – the 28 × 5 × 4 3D array corresponding to the 4 convolutional filters applied

to X input. The first filter corresponds to F(:, :, 1), the second to F(:, :, 2),

etc.

• vecF – the 28*5*4 × 1 vector corresponding to the vectorized version of F. Note

in Matlab the operation F(:) vectorises F by first flattening F(:, :, 1) then F(:,

:, 2) etc.

13

• S – the 4 × nlen-5+1 matrix corresponding to the output of convolving X input

with the filters stored in F.

• vecS – the 4*(nlen-4) × 1 vector corresponding to the vectorized version of S.

From x input and F you can use your newly written functions to compute

MX and MF and then check if applying these matrices results in the vector

vecS. You can also check that if you reshape your calculated version of vecS

you get S.

Once you have a bug-free version of MakeMXMatrix and MakeMFMatrix then

you can spend some time optimizing your implementation especially (if you

have used the function kron to build MX). Remember the profile command

is your friend when optimizing code in Matlab.

0.4 Implement the forward & backward pass of back-prop

Now you are ready to implement the back-propagation algorithm described

by the equations in the Background 4 section. I suggest you first start with

a mini-batch of size 1, then 2 and then 3 and also that you have a relatively

small number of filters per layer (say 5).

When I did the numerical checks

of my analytical gradients, I applied 5 filters of width 5 in both layers. To

perform the numerical checks of your gradient you will need to write the

function to compute the loss for a mini-batch of data, see equation (27).

function loss = ComputeLoss(X batch, Ys batch, MFs, W)

where X batch is matrix of size nlen*d × n that contains all the vectorised

input data (each column corresponds to one vectorized datapoint), Y batch

is the one-hot encoding of the labels of each example, MFs is a cell array

containing the MF matrix for each layer and W is the 2D weight matrix for

the last fully connected layer.

Here I suggest you use the MF matrices to

carry out the convolutions as this will be the most efficient way to compute

the loss on the validation set during training (if you don’t explicitly exploit

the sparse of X batch).

From the course Canvas page you can download my code for computing the

numerical gradients. You will, of course, have to adapt it to fit your code.

As per usual you should take care to make your numerical and analytical

gradients match before you move on to the next stage.

0.5 Train using mini-batch gradient descent with momentum

Once you have convinced yourself that your analytic gradient calculations

are correct, you are ready to write code to start training the parameters of

your network. I suggest that you begin with an implementation of ordinary

mini-batch gradient descent.

When you have implemented that then upgrade the optimizer to include a momentum term. Just one technical detail

in the back-prop equations you compute – the gradients you compute are

14

all vectors but you will probably have stored the convolutional filters as 3D

arrays. You can use the reshape function to

grad F1 = reshape(grad vecF1, [d, k1, n1]);

I suggest you get the training algorithm working for a small sized network

for example filters of size 5 or 3 at each layer and then ∼10 filters at each

layer and mini-batch size 100. After every nupdateth update step you should

compute the validation loss and accuracy. I set nupdate = 500. What I also

found useful was to also regularly compute and print out the confusion

matrix generated by the network when applied to the validation set. The

confusion matrix is the K × K matrix M where Mij is the number of examples with label i that are classified as class j by your classifier.

By examining

this matrix you can check whether the unbalanced data is causing a problem

such as all examples are classified to a small subset of the classes.

When you do the initial experiments with your small network you should

get sufficiently good results to indicate

1. the speed bottlenecks in your code when you use the profile command,

2. whether some form of useful learning is occurring and

3. whether the unbalanced dataset is causing a problem or not.

At this stage you can decide to implement the computational efficiencies

described in the Background 5 section especially if you want to train a

much bigger network for ∼100,000 update steps and to ensure it will not take

too long. Another tip is that Matlab makes copies of everything when you

pass variables to functions (there really is no concept of passing by reference).

Thus in the innermost loops it makes sense to cut-and-paste code from the

functions you call. Yes, it makes things much less readable but it in many

cases it saves lots of time. And as always in Matlab pre-allocate memory for

matrices whenever you can!

Once you have a relatively efficient implementation then you can start to

do some serious training and search for a good parameter setting. What I

found, in general, was that the more filters I applied at each layer the better

performance I got in a fewer number of update steps.

0.6 Account for the unbalanced dataset

Because the dataset is quite unbalanced it makes sense to compensate for

this fact, so if at test time the proportion of names from each class is drastically different then during training your network can cope. In the section

Background 6 I describe how to do this. Please implement one of the

alternatives and notice the qualitative differences between training and the

intermediary values of the validation loss, accuracy and the confusion matrices when you compensate and do not compensate for the imbalanced classes.

To complete the assignment:

To pass the assignment you need to upload to Canvas:

1. The code for this assignment.

2. A brief pdf report with the following content:

i) State how you checked your analytic gradient computations and

whether you think that your gradient computations are bug free

for your ConvNet.

ii) Comment on how you compensated for the imbalanced dataset

during training.

iii) Include a graph of the validation loss, equation (47), for a longish

training ≥ 20, 000 update steps for a network with k1 = 5, k2 = 3

and n1 = 20, n2 = 20 when you train without compensating for

the unbalanced dataset and when you do. Include also the final

confusion matrix on the validation set in both cases. Comment

on the qualitative differences between the two different training

regimes.

iv) Include the validation loss function, equation (47), for your best

performing ConvNet. State the final validation accuracy of this

network and its accuracy for each class. Detail the parameter settings and how trained the network and the training parameters.

v) State whether you implemented the efficiency gains in Background 5 and by how much they helped speed up training. You

can quote how long it took to perform a fixed number of update

steps (∼100) with mini-batches of size 100 before and after the

upgrade.

vi) Show the probability vector output by your best network when

applied to your surname and those of 5 of your friends. Comment

on the results!

### Exercise 2: Optional for bonus points

We have only scratched the surface of all the tweaks one could apply to make

our ConvNet a better predictor for this problem. They are many options

for improvement

• Implement a max-pooling layer see the reference paper for how they

applied this problem. This will allow you to apply more filters at each

layer because if you reduce nleni

at each layer then you will have less

computational and memory requirements during training for a fixed

number of filters.

• Make your code generic so you can define networks with L convolutional layers. And then perform a search for a good number of layers

and the number of filters applied at each layer. I’m interested in knowing does adding more convolutional layers make things better or does

training become too hard.

• Make your convolution implementation more generic so you can apply

filters with arbitrary strides and zero-padding.

• Include the omitted bias terms at each layer.

• Apply dilated convolutions. You could then get long range correlations

but with the same number of parameters and computational effort.

The list is endless… I will award 2 bonus for each reported significant upgrade implemented. You max out after 6 bonus points. I’m aware, of course,

that an improved network does not necessarily correspond to improved performance, but as long as you implemented something reasonable…