Description
1 Theory (50pt)
1.1 Convolutional Neural Netoworks (15 pts)
(a) (1 pts) Given an input image of dimension 12 × 21, what will be output
dimension after applying a convolution with 5×4 kernel, stride of 4, and no
padding?
(b) (2 pts) Given an input of dimension C × H ×W, what will be the dimension
of the output of a convolutional layer with kernel of size K × K, padding P,
stride S, dilation D, and F filters. Assume that H ≥ K, W ≥ K.
(c) (12 pts) In this section, we are going to work with 1-dimensional convolutions. Discrete convolution of 1-dimensional input x[n] and kernel k[n] is
defined as follows:
s[n] = (x∗ k)[n] =
X
m
x[n− m]k[m]
However, in machine learning convolution is usually implemented as crosscorrelation, which is defined as follows:
s[n] = (x∗ k)[n] =
X
m
x[n+ m]k[m]
Note the difference in signs, which will get the network to learn an “flipped”
kernel. In general it doesn’t change much, but it’s important to keep it
in mind. In convolutional neural networks, the kernel k[n] is usually 0
everywhere, except a few values near 0: ∀|n|>M k[n] = 0. Then, the formula
becomes:
s[n] = (x∗ k)[n] =
X
M
m=−M
x[n+ m]k[m]
Let’s consider an input x[n] ∈ R
5
, with 1 ≤ n ≤ 7, e.g. it is a length 7 sequence with 5 channels. We consider the convolutional layer fW with one
filter, with kernel size 3, stride of 2, no dilation, and no padding. The only
parameters of the convolutional layer is the weight W, W ∈ R
1×5×3
, there’s
no bias and no non-linearity.
(i) (1 pts) What is the dimension of the output fW (x)? Provide an expression for the value of elements of the convolutional layer output fW (x).
Example answer format here and in the following sub-problems: fW (x) ∈
R
42×42×42
, fW (x)[i, j,k] = 42.
(ii) (3 pts) What is the dimension of ∂fW (x)
∂W
? Provide an expression for the
values of the derivative ∂fW (x)
∂W
.
(iii) (3 pts) What is the dimension of ∂fW (x)
∂x
? Provide an expression for the
values of the derivative ∂fW (x)
∂x
.
2
(iv) (5 pts) Now, suppose you are given the gradient of the loss ℓ w.r.t.
the output of the convolutional layer fW (x), i.e. ∂ℓ
∂fW (x)
. What is the
dimension of ∂ℓ
∂W
? Provide an expression for ∂ℓ
∂W
. Explain similarities
and differences of this expression and expression in (i).
1.2 Recurrent Neural Networks (30 pts)
1.2.1 Part 1
In this section we consider a simple recurrent neural network defined as follows:
c[t] = σ(Wcx[t]+Whh[t−1]) (1)
h[t] = c[t]⊙ h[t−1]+(1− c[t])⊙Wx x[t] (2)
where σ is element-wise sigmoid, x[t] ∈ R
n
, h[t] ∈ R
m, Wc ∈ R
m×n
, Wh ∈ R
m×m,
Wx ∈ R
m×n
, ⊙ is Hadamard product, h[0] .
= 0.
(a) (4 pts) Draw a diagram for this recurrent neural network, similar to the
diagram of RNN we had in class. We suggest using diagrams.net.
(b) (1pts) What is the dimension of c[t]?
(c) (5 pts) Suppose that we run the RNN to get a sequence of h[t] for t from 1
to K. Assuming we know the derivative ∂ℓ
∂h[t]
, provide dimension of and an
expression for values of ∂ℓ
∂Wx
. What are the similarities of backward pass
and forward pass in this RNN?
(d) (2pts) Can this network be subject to vanishing or exploding gradients?
Why?
1.2.2 Part 2
We define an AttentionRNN(2) as
q0[t], q1[t], q2[t] = Q0x[t],Q1h[t−1],Q2h[t−2] (3)
k0[t],k1[t],k2[t] = K0x[t],K1h[t−1],K2h[t−2] (4)
v0[t],v1[t],v2[t] = V0x[t],V1h[t−1],V2h[t−2] (5)
wi[t] = qi[t]
⊤
ki[t] (6)
a[t] = softargmax([w0[t],w1[t],w2[t]]) (7)
h[t] =
X
2
i=0
ai[t]vi[t] (8)
Where x[t],h[t] ∈ R
n
, and Qi
,Ki
,Vi ∈ R
n×n
. We define h[t] = 0 for t < 1. You may
safely ignore these bases cases in the following questions.
3
(a) (4 pts) Draw a diagram for this recurrent neural network
(b) (1 pt) What is the dimension of a[t]?
(c) (3 pts) Extend this to, AttentionRNN(k), a network that uses the last k
state vectors h. Write out the system of equations that defines it. You may
use set notation or ellipses (…) in your definition.
(d) (3 pts) Modify the above network to produce AttentionRNN(∞), a network
that uses every past state vector. Write out the system of equations that defines it. You may use set notation or ellipses (…) in your definition. HINT:
We can do this by tying together some set of parameters, e.g. weight sharing.
(e) (5 pts) Suppose the loss ℓ is computed. Please write down the expression
for ∂h[t]
∂h[t−1] for AttentionRNN(2).
(f) (2 pts) Suppose we know the derivative ∂h[t]
∂h[T]
, and ∂ℓ
∂h[t]
for all t > T. Please
write down the expression for ∂ℓ
∂h[T]
for AttentionRNN(k).
1.3 Debugging loss curves (5pts)
When working with notebook
08-seq_classification, we saw RNN training curves. In Section 8 “Visualize
LSTM”, we observed some “kinks” in the loss curve.
0 20 40 60 80 100
epoch
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
loss
Train
Test
0 20 40 60 80 100
epoch
0
20
40
60
80
100
acc
Train
Test
1. (1pts) What caused the spikes on the left?
2. (1pts) How can they be higher than the initial value of the loss?
3. (1pts) What are some ways to fix them?
4. (2pts) Explain why the loss and accuracy are at these values before training
starts. You may need to check the task definition in the notebook.
4
2 Implementation (50pts + 5pts extra credit)
There are three notebooks in the practical part:
• (25pts) Convolutional Neural Networks notebook: hw2_cnn.ipynb
• (20pts) Recurrent Neural Networks notebook: hw2_rnn.ipynb
• (5pts + 5pts extra credit) : This builds on Section 1.3 of the theoretical part.
– (5pts) Change the model training procedure of Section 8 in
08-seq_classification to make the training curves have no spikes.
You should only change the training of the model, and not the model
itself or the random seed.
– (5pts extra credit) Visualize the gradients and weights throughout
training before and after you fix the training procedure.
Plase use your NYU Google Drive account to access the notebooks. First
two notebooks contain parts marked as TODO, where you should put your code.
These notebooks are Google Colab notebooks, you should copy them to your drive,
add your solutions, and then download and submit them to NYU Brightspace.
The notebook from the class, if needed, can be uploaded to Colab as well.
5