Description
1 [25 points] Neural network layer implementation.
In this problem, you will implement various layers. X, Y will represent the input and the output of the
layers, respectively. We use the row-major notation here (i.e. each row of X and Y corresponds to one data
sample) to be compatible with python style coding. L is a scalar valued function of Y (i.e. the loss function).
For each layer, your derivation of the gradient should be included in your written solution. In addition
to the following derivations, implement the corresponding forward and backward passes in layers.py. The
file autograder.py could be helpful to check the correctness of the your implementation. Yet, please notice
that for each function the autograder just checks the result for one specific example. Passing these tests does
not necessarily mean that your implementation is 100% correct.
(a) [9 points] Fully-connected layer
Let X ∈ R
N×Din , where N is the number of samples in a batch. Consider a dense layer with parameters
W ∈ R
Din×Dout , b ∈ R
Dout . The layer outputs the matrix Y ∈ R
N×Dout and Y = XW + B where
B ∈ R
N×Dout and each row of B is the vector b. Obviously, ∀1 ≤ i ≤ N, the sample x
(i) ∈ R
Din is fed
into this layer and outputs y
(i) = WT x
(i) + b where y
(i) ∈ R
Dout
.
Compute the partial derivatives ∂L
∂W ,
∂L
∂b ,
∂L
∂X in terms of ∂L
∂Y and/or ∂L
∂Y (i)
j
, where ∂L
∂Y (i)
j
is the gradient
with respect to j-th element of i-th sample in Y . Please note that ∂L
∂W is a matrix in R
Din×Dout where
the element in i-th row and j-th column is ∂L
∂Wi,j
and Wi,j is the i-th row, j-th column element of W.
Similarly, ∂L
∂b ∈ R
Dout
,
∂L
∂X ∈ R
N×Din ,
∂L
∂Y ∈ R
N×Dout . [Hint:you may first calculate the gradient ∂L
∂Wi,j
using the formula ∂L
∂Wi,j
=
PN
n=1
PDout
m=1
∂L
∂Y (n)
m
∂Y (n)
m
∂Wi,j
.]
(b) [5 points] ReLU
Let X be a tensor and Y = ReLU(X). ReLU is applied to X in an elementwise way. For an element
x ∈ X, the corresponding output is y = ReLU(x) = max(0, x). Y has the same shape as X. Express
∂L
∂X in terms of ∂L
∂Y , where ∂L
∂X has the same shape as X.
(c) [11 points] Convolution
Note: In this question, any indices involved (i, j, etc.) start from 1, and not 0.
Given 2-d tensors a ∈ R
Ha×Wa and b ∈ R
Hb×Wb
, we define the valid convolution and full convolution
operations as follows,
(a ∗valid b)i,j =
i+
X
Hb−1
m=i
j+
X
Wb−1
n=j
am,nbi−m+Hb,j−n+Wb
(a ∗full b)i,j =
X
i
m=i−Hb+1
X
j
n=j−Wb+1
am,nbi−m+1,j−n+1
The convolution operation we discussed in class is valid convolution, and does not involve any zero
padding. This operation produces an output of size (Ha − Hb + 1) × (Wa − Wb + 1).
Full convolution can be thought of as zero padding a on all sides with width and height one less than the
size of the kernel (i.e. Hb − 1 vertically and Wb − 1 horizontally) and then performing valid convolution.
In the definition of full convolution, am,n = 0 if m < 1 or n < 1. This operation produces an output of
size (Ha + Hb − 1) × (Wa + Wb − 1) (Verify this).
It is also useful to consider the filtering operation ∗filt, defined by
(a ∗filt b)i,j =
i+
X
Hb−1
m=i
j+
X
Wb−1
n=j
am,nbm−i+1,n−j+1 =
X
Hb
p=1
X
Wb
q=1
ai+p−1,j+q−1bp,q
2
The filtering operation is similar to the valid convolution, except that the filter is not flipped when
computing the weighted sum.
Assume the input to the layer is given by X ∈ R
N×C×H×W , where N is the number of sample images, C
is the number of channels, H is the height of a sample image, W is the width of a sample image. Consider
a convolutional kernel W ∈ R
F ×C×H0×W0
. The output of this layer is given by Y ∈ R
N×F ×H00×W00 where
H00 = H − H0 + 1, W00 = W − W0 + 1.
The layer produces F output feature maps defined by Yn,f =
PC
c=1 Xn,c ∗validWf,c, where Wf,c represents
the flipped kernel (i.e., Ki,j is defined as Ki,j = KH0+1−i,W0+1−j ). Note that Yn,f =
PC
c=1 Xn,c ∗valid
Wf,c =
PC
c=1 Xn,c ∗filt Wf,c (So you may implement Yn,f =
PC
c=1 Xn,c ∗filt Wf,c in the conv_forward
function in layers.py).
Show that
∂L
∂Xn,c
=
X
F
f=1
Wf,c ∗full
∂L
∂Yn,f
and
∂L
∂Wf,c
=
X
N
n=1
Xn,c ∗filt
∂L
∂Yn,f
Please note that the gradient ∂L
∂Xn,c
∈ R
H×W , where the element of ∂L
∂Xn,c
in the i-th row and j-th column
is ∂L
∂Xn,c,i,j
and Xn,c,i,j is a scalar value in the i-th row, j-th column, c-th channel of the n-th sample
image. Similarly, ∂L
∂Wf,c
∈ R
H0×W0
. [Hint: you may first derive the gradient ∂L
∂Xn,c,i,j
and ∂L
∂Wf,c,i,j
using
the chain rule. Then it is easy to show ∂L
∂Xn,c
and ∂L
∂Wf,c
.]
To anwser question 2-4, please read through solver.py and familiarize yourself with the API. To build the
models, please use the intermediate layers you implemented in question 1. After doing so, use a Solver
instance to train the models.
2 [20 points] Softmax regression and beyond: multi-class classification with a softmax output layer
(a) Implement softmax loss layer softmax_loss in layers.py
(b) Implement softmax multi-class classification using the starter code provided in softmax.py.
(c) Train a 2 layer neural network with softmax-loss as the output layer on MNIST dataset with digit_classification.py.
Identify and report an appropriate number of hidden units based on the validation set. Report the best
test accuracy for your best model.
3 [20 points] Convolutional Neural Network for multi-class classification
(a) Implement the forward and backward passes of max-pooling layer in layers.py
(b) Implement CNN for softmax multi-class classification using the starter code provided in cnn.py.
(c) Train the CNN multi-class classifier on MNIST dataset with digit_classification.py. Identify and
report an appropriate filter size of convolutional layer and appropriate number of hidden units in fullyconnected layer based on the validation set. Report the best test accuracy for your best model.
3
4 [20 points] Application to Image Captioning
In this problem, you will apply the RNN module you implemented to build an image captioning model.
Please unzip hymenoptera_data.zip to get the data.
(a) At every timestep we use a fully-connected layer to transform the RNN hidden vector at that timestep
into scores for each word in the vocabulary. This is very similar to the fully-connected layer that you
implemented in Q1. Implement the forward pass in temporal_fc_forward function and the backward
pass in temporal_fc_backward function in rnn_layers.py. autograder.py could also help debugging.
(b) In an RNN language model, at every timestep we produce a score for each word in the vocabulary. We
know the ground-truth word at each timestep, so we use a softmax loss function to compute loss and
gradient at each timestep. We sum the losses over time and average them over the minibatch. Since we
operate over minibatches and different captions may have different lengths, we append NULL tokens to
the end of each caption so they all have the same length. We don’t want these NULL tokens to count
toward the loss or gradient, so in addition to scores and ground-truth labels our loss function also accepts
a mask array that tells it which elements of the scores count towards the loss. This is very similar to
the softmax loss layer that you implemented in Q2. Implement temporal_softmax_loss function in
rnn_layers.py. autograder.py could also help debugging.
(c) Now that you have the necessary layers in rnn_layers.py, you can combine them to build an image
captioning model. Implement the forward and backward pass of the model in the loss function for RNN
model, and the test-time forward pass in sample function in rnn.py.
(d) With RNN, run the script image_captioning.py to get learning curves of training loss and the learned
captions for samples. Report the learning curves and the caption samples based on your well-trained
network.
5 [15 points] Transfer learning
In this problem, you will be more familiar with PyTorch and run experiments for two major transfer learning
scenarios in transfer_learning.py.
(a) Fill in the blank in train_model function which is a general function for model training.
(b) Fill in the blank in visualize_model function to briefly visualize how the trained model performs on
validation images.
(c) Fill in the blank in finetune function. Instead of random initialization, we initialize the network with
a pre-trained network. Rest of the training looks as usual.
(d) Fill in the blank in freeze function. We will freeze the weights for all of the network except that of
the final fully connected layer. This last fully connected layer is replaced with a new one with random
weights and only this layer is trained.
(e) Run the script and report the accuracy on validation dataset for these two scenarios.
Update log
• (Feb/27th/16:00): clarify the architecture of neural networks for Q2 and Q3 in the comments of the
starter code. In softmax.py, the architecture of the neural network is “fc – relu – fc – softmax”. In
cnn.py, the architecture of the neural network is “conv – relu – 2×2 max pool – fc – relu – fc – softmax”.
• (Mar/12th/19:00): in Q1(a), the partial gradient can be expressed in terms of “ ∂L
∂Y and/or ∂L
∂Y (i)
j
”.
4
• (Mar/12th/19:00): in Q1(a), provide the hint “you may first calculate the gradient ∂L
∂Wi,j
using the
formula ∂L
∂Wi,j
=
PN
n=1
PDout
m=1
∂L
∂Y (n)
m
∂Y (n)
m
∂Wi,j
.”.
• (Mar/12th/19:00): for Q1, emphasize that “your derivation of the gradient should be included in your
written solution”.
• (Mar/15th/9:00): In the formula of valid convolution in Q1(c), H0
is changed to Hb and W0
is changed
to Wb.
Credits
Some questions adopted/adapted from https://cs231n.stanford.edu/.
5