## Description

1. (10 points) Exercise 7.7 (e-Chap:7-11) in the book “Learning from Data”.

2. (10 points) Exercise 7.8 (e-Chap:7-15) in the book “Learning from Data”.

3. (20 points) Consider the standard residual block and the bottleneck block in the case where

inputs and outputs have the same dimension (e.g. Figure 5 in [1]). In another word, the

residual connection is an identity connection. For the standard residual block, compute the

number of training parameters when the dimension of inputs and outputs is 128×16×16×32.

Here, 128 is the batch size, 16 × 16 is the spatial size of feature maps, and 32 is the number

of channels. For the bottleneck block, compute the number of training parameters when the

dimension of inputs and outputs is 128 × 16 × 16 × 128. Compare the two results and explain

the advantages and disadvantages of the bottleneck block.

4. (20 points) Using batch normalization in training requires computing the mean and variance

of a tensor.

1

(a) (8 points) Suppose the tensor x is the output of a fully-connected layer and we want to

perform batch normalization on it. The training batch size is N and the fully-connected

layer has C output nodes. Therefore, the shape of x is N × C. What is the shape of the

mean and variance computed in batch normalization, respectively?

(b) (12 points) Now suppose the tensor x is the output of a 2D convolution and has shape

N × H × W × C. What is the shape of the mean and variance computed in batch

normalization, respectively?

5. (50 points) We investigate the back-propagation of the convolution using a simple example. In

this problem, we focus on the convolution operation without any normalization and activation

function. For simplicity, we consider the convolution in 1D cases. Given 1D inputs with a

spatial size of 4 and 2 channels, i.e.,

X =

x11 x12 x13 x14

x21 x22 x23 x24

∈ R

2×4

, (1)

we perform a 1D convolution with a kernel size of 3 to produce output Y with 2 channels.

No padding is involved. It is easy to see

Y =

y11 y12

y21 y22

∈ R

2×2

, (2)

where each row corresponds to a channel. There are 12 training parameters involved in this

convolution, forming 4 different kernels of size 3:

Wij = [w

ij

1

, w

ij

2

, w

ij

3

], i = 1, 2, j = 1, 2, (3)

where Wij scans the i-th channel of inputs and contributes to the j-th channel of outputs.

Note that the notation here might be slightly different in that, one kernel/filter here connects

ONE input feature map (instead of ALL input feature maps) to ONE output feature map.

(a) (15 points) Now we flatten X and Y to vectors as

X˜ = [x11, x12, x13, x14, x21, x22, x23, x24]

T

Y˜ = [y11, y12, y21, y22]

T

Please write the convolution in the form of fully connected layer as Y˜ = AX˜ using the

notations above. You can assume there is no bias term.

Hint: Note that we discussed how to view convolution layers as fully connected layers in

the case of single input and output feature maps. This example asks you to extend that

to the case of multiple input and output feature maps.

(b) (15 points) Next, for the back-propagation, assume we’ve already computed the gradients

of loss L with respect to Y˜ :

∂L

∂Y˜

=

∂L

∂y11

,

∂L

∂y12

,

∂L

∂y21

,

∂L

∂y22 T

, (4)

Please write the back-propagation step of the convolution in the form of ∂L

∂X˜ = B

∂L

∂Y˜

.

Explain the relationship between A and B.

2

(c) (20 points) While the forward propagation of the convolution on X to Y could be written

into Y˜ = AX˜, could you figure out whether ∂L

∂X˜ = B

∂L

∂Y˜

also corresponds to a convolution

on ∂L

∂Y to ∂L

∂X ? If yes, write down the kernels for this convolution. If no, explain why.

6. (90 points)(Coding Task) Deep Residual Networks for CIFAR-10 Image Classification: In this assignment, you will implement advanced convolutional neural networks on

CIFAR-10 using Tensorflow. In this classification task, models will take a 32 × 32 image

with RGB channels as inputs and classify the image into one of ten pre-defined classes. The

“ResNet” folder provides the starting code. You must implement the model using the starting

code. In this assignment, you must use a GPU.

Requirements: Python 3.6, Tensorflow 1.10 (Make sure you use this particular version of

installation and documentation!!!), tqdm, numpy

(a) (10 points) Download the CIFAR-10 dataset (https://www.cs.toronto.edu/~kriz/

cifar.html) and complete “DataReader.py”. For the dataset, you can download any

version. But make sure you write corresponding code in “DataReader.py” to read it.

(b) (10 points) Implement data augmentation. To complete “ImageUtils.py”, you will implement the augmentation process for a single image using numpy. Corresponding Tensorflow functions are given.

(c) (30 points) Complete “Network.py”. Read the required materials carefully before this

step. You are asked to implement two versions of ResNet: version 1 uses original residual

blocks (Figure4(a) in [2]) and version 2 uses full pre-activation residual blocks (Figure4(e)

in [2]). In particular, for version 2, implement the bottleneck blocks instead of standard

residual blocks. In this step, only basic Tensorflow APIs in tf.layers and tf.nn are allowed

to use.

(d) (20 points) Complete “Model.py”. Note: For this step and last step, pay attention to

how to use batch normalization.

(e) (20 points) Tune all the hyperparameters in “main.py” and report your final testing

accuracy.

3