Solved Homework 1: Backpropagation CSCI-GA 2572 Deep Learning Fall 2025

$30.00

Original Work ?

Download Details:

  • Name: hw1-owj3ck.zip
  • Type: zip
  • Size: 561.04 KB

Category: Tags: , , You will Instantly receive a download link upon Payment||Click Original Work Button for Custom work

Description

5/5 - (1 vote)

The goal of homework 1 is to help you understand the common techniques used
in Deep Learning and how to update network parameters by the using backpropagation algorithm.
Part 1 has two sub-parts, 1.1, 1.2, 1.3 majorly deal with the theory of backpropagation algorithm whereas 1.4 is to test conceptual knowledge on deep learning.
For part 1.2 and 1.3, you need to answer the questions with mathematical equations. You should put all your answers in a PDF file and we will not accept any
scanned hand-written answers. It is recommended to use LATEX.
For part 2, you need to program in Python. It requires you to implement your
own forward and backward pass without using autograd. You need to submit
your mlp.py file for this part.
The due date of homework 1 is 23:55 ET of 09/20. Submit the following files in
a zip file your_net_id.zip through Brightspace course page:
• theory.pdf
• mlp.py
• gd.py
The following behaviors will result in penalty of your final score:
1. 5% penalty for submitting your files without using the correct format. (including naming the zip file, PDF file or python file wrong, or adding extra
files in the zip folder, like the testing scripts from part 2).
2. 20% penalty for late submission within the first 24 hours. We will not
accept any late submission after the first 24 hours.
3. 20% penalty for code submission that cannot be executed using the steps
we mentioned in part 2. So please test your code before submit it.
1
1 Theory (50pt)
To answer questions in this part, you need some basic knowledge of linear algebra
and matrix calculus. Also, you need to follow the instructions:
1. Every provided vector is treated as column vector.
2. IMPORTANT: You need to use the numerator-layout notation for matrix
calculus. Please refer to Wikipedia about the notation. Specifically, ∂y
∂x
is a
row-vector whereas ∂y
∂x
is a column-vector
3. You are only allowed to use vector and matrix. You cannot use tensor in
any of your answer.
4. Missing transpose are considered as wrong answer.
1.1 Two-Layer Neural Nets
You are given the following neural net architecture:
Linear1 → f → Linear2 → g
where Lineari(x) = W(i)x + b
(i)
is the i-th affine transformation, and f , g are
element-wise nonlinear activation functions. When an input x ∈ R
n
is fed to the
network, yˆ ∈ R
K is obtained as the output.
1.2 Regression Task
We would like to perform regression task. We choose f (·) = 5(·)
+ = 5ReLU(·) and g
to be the identity function. To train this network, we choose MSE loss function
ℓMSE(yˆ, y) = ∥yˆ − y∥
2
, where y is the target output.
(a) (1pt) Name and mathematically describe the 5 programming steps you
would take to train this model with PyTorch using SGD on a single batch
of data.
(b) (4pt) For a single data point (x, y), write down all inputs and outputs for forward pass of each layer. You can only use variable x, y,W(1)
,b
(1)
,W(2)
,b
(2)
in your answer. (note that Lineari(x) = W(i)x+ b
(i)
).
(c) (6pt) Write down the gradients calculated from the backward pass. You can
only use the following variables: x, y,W(1)
,b
(1)
,W(2)
,b
(2)
,
∂ℓ
∂yˆ
,
∂z2
∂z1
,
∂yˆ
∂z3
in your
answer, where z1, z2, z3, yˆ are the outputs of Linear1, f ,Linear2, g.
(d) (2pt) Show us the elements of ∂z2
∂z1
,
∂yˆ
∂z3
and ∂ℓ
∂yˆ
(be careful about the dimensionality)?
2
1.3 Classification Task
We would like to perform multi-class classification task, so we set f = tanh and
g = σ, the logistic sigmoid function σ(z)
.
= (1+exp(−x))−1
.
(a) (2pt + 3pt + 1pt) If you want to train this network, what do you need to
change in the equations of (b), (c) and (d), assuming we are using the same
MSE loss function.
(b) (2pt + 3pt + 1pt) Now you think you can do a better job by using a Binary Cross Entropy (BCE) loss function ℓBCE(yˆ, y) =
1
K
PK
i=1

£
yi
log(yˆi) +
(1 − yi)log(1 − yˆi)
¤
. What do you need to change in the equations of (b), (c)
and (d)?
(c) (1pt) Things are getting better. You realize that not all intermediate hidden
activations need to be binary (or soft version of binary). You decide to use
f (·) = (·)
+ but keep g as tanh. Explain why this choice of f can be beneficial
for training a (deeper) network.
1.4 Deriving Loss Functions
Derive the loss function for the following algorithms based on their common update rule wi ← wi+η(y−yˆ)xi
. Show the steps of the derivation given the following
inference rules (simply stating the final loss function will receive no points).
1. (4 points) Perceptron yˆ = sign¡
b +
Pd
i=1 wixi
¢
2. (4 points) Adaline / Least Mean Squares yˆ = b +
Pd
i=1 wixi
3. (4 points) Logistic Regression yˆ = tanh¡
b +
Pd
i=1 wixi
¢
1.5 Conceptual Questions
(a) (1pt) Why is softmax actually softargmax?
(b) (3pt) Draw the computational graph defined by this function, with inputs
x, y, z ∈ R and output w ∈ R. You make use symbols x, y, z, o, and operators
∗,+ in your solution. Be sure to use the correct shape for symbols and
operators as shown in class.
a = x∗ y+ z
b = (x+ x)∗a
w = a∗ b
3
(c) (2pt) Draw the graph of the derivative for the following functions?
• ReLU()
• LeakyReLU(negative_slope=0.01)
• Softplus(beta=1)
• GELU()
4
(d) (3pt) What are 4 different types of linear transformations? What is the
role of linear transformation and non linear transformation in a neural
network?
(e) (3pt) Given a neural network F parameterized by parameters θ, denoted
Fθ, dataset D = x1, x2,…, xN , and labels Y = y1, y2,…, yN , write down the
mathematical definition of training a neural network with the MSE loss
function ℓMSE(yˆ, y) = ∥yˆ − y∥
2
.
2 Implementation (50pt)
2.1 Backpropagation (35pt)
You need to implement the forward pass and backward pass for Linear, ReLU,
Sigmoid, MSE loss, and BCE loss in the attached mlp.py file. We provide three
example test cases test1.py, test2.py, test3.py. We will test your implementation with other hidden test cases, so please create your own test cases to make
sure your implementation is correct.
Recommendation: Go through this Pytorch tutorial to have a thorough understanding of Tensors.
Extra instructions:
1. Please use Python version ≥ 3.7 and PyTorch version 1.7.1. We recommend
you to use Miniconda the manage your virtual environment.
2. We will put your mlp.py file under the same directory of the hidden test
scripts and use the command python hiddenTestScriptName.py to check
your implementation. So please make sure the file name is mlp.py and it
can be executed with the example test scripts we provided.
3. You are not allowed to use PyTorch autograd functionality in your implementation.
4. Be careful about the dimensionality of the vector and matrix in PyTorch. It
is not necessarily follow the the Math you got from part 1.
2.2 Gradient Descent (15pt + 5pt)
In DeepDream, the paper claims that you can follow the gradient to maximize an
energy with respect to the input in order to visualize the input1
. We provide some
code to do this. Given a image classifier, implement a function that performs optimization on the input (the image), to find the image that most highly represents
1See Simonyan et al., 2013, and Mordvintsev, Olah, and Tyka, 2015.
5
the class. You will need to implement the gradient_descent function in gd.py.
You will be graded on how well the model optimizes the input with respect to the
labels.
Extra hints:
1. We try to minimize the energy of the class, e.g. maximize the class logit.
Make sure you are following the gradient in the right direction
2. A reasonable starting learning rate to try is 0.01, but depending on your
implementation, make sure to sweep across a few magnitudes.
3. Make sure you use normalize_and_jitter, since the neural network expect a normalized input. Jittering produces more visually pleasing results
You may notice that the images that you generate are very messy and full of
high frequency noise. Extra credit (5 points) can be had by generating visually
pleasing images, and experimenting with visualizing the middle layers of the
network. There are some tricks to this:
1. Blur the image at each iteration, which reduces high frequency noise
2. Clamp the pixel values between 0 and 1
3. Implement weight decay
4. Blur the gradients at each iteration
5. Implement gradient descent at multiple scales, scaling up every so often