Description
1. [Backpropagation] We want to train a simple deep neural network fw(x) with w =
(w1, w2, w3)
⊤ ∈ R
3 and x ∈ R, defined as:
fw(x) := w3σ2(w2σ1(w1x))
where σ1(u) = σ2(u) = 1
1+exp(−u)
, i.e., sigmoid activation. You may denote x1 := w1x
and x2 := w2σ1(x1) for notational convenience.
(a) [1pt] Illustrate a directed acyclic graph corresponding to the computation of
fw(x).
(b) [2pt] Compute ∂σ1
∂u and provide the answer in two different forms: (i) using only
u and the exponential functions; and (ii) using only σ1(u).
(c) [2pt] Describe briefly what is meant by a forward pass and a backward pass?
(d) [2pt] Compute ∂fw
∂w3
.
(e) [4pt] Compute ∂fw
∂w2
and ∂fw
∂w1
using the second option in Problem 1b.
(f) [4pt] Consider mean squared error (MSE) L(w) for given D = {(x1, y1),(x2, y2)}:
L(w) := X
(x,y)∈D
(y − fw(x))2
.
Derive a gradient descent step of learning rate α > 0 to train w with the above
MSE as loss function, using the second option in Problem 1b.
2
2. [MNIST] We want to train a convolutional neural net for 10-class classification of
MNIST images which are of size 28 × 28, where MNIST dataset is a collection of
handwritten digits from 0 to 9, commonly used for training and testing in the field of
machine learning1
. To begin with, you need to install pytorch and torchvision first2
.
It would be helpful to check pytorch documentation3 while solving this problem.
(a) [2pt] As a first layer, we use 20 channels 2D-convolutions each with a filter size
of 5 × 5, a stride of 1 and a padding of 0. What is the output dimension after
this layer? Subsequently, we apply max-pooling with a size of 2 × 2. What is the
output dimension after this layer?
(b) [4pt] After the pair of convolution and pooling layers designed in Problem 2a,
we want to use a second pair (convolution + max-pooling). The max-pooling
operation has a filter size of 2 × 2. The desired output should have 50 channels
and should be of size 4 × 4. What are the filter size, the stride, and the channel
dimension of the second convolution operation, assuming that padding is 0?
(c) [7pt] Complete DeepMNIST.py by implementing the neural network model described in what follows. (Provide your entire code pertaining to the “Net” class
here.) Given an image x, we apply (i) the first convolution-pooling pair in Problem 2a, with ReLU activations after the convolution; (ii) the second convolutionpooling pair in Problem 2b, with ReLU activations after the convolution; and
(iii) two linear layers such that the first one with ReLU activations maps from a
50 × 4 × 4 dimensional space to a 500 dimensional one, and the second one with
no activations maps from a 500 dimensional space to a 10 dimensional one as a
score vector on 10 classes. What is the best test set accuracy that you observed
during training with this architecture? How many parameters does your network
have (including biases)?
1https://yann.lecun.com/exdb/mnist/
2https://pytorch.org/get-started/locally/
3https://pytorch.org/docs/stable/index.html
3