Description
Clarifications
All the modifications to the original assignment will be marked with red in the main body of the
assignment. A complete list of the modifications can be found below:
• Question 1: There was a missing square in the first term of the loss function. This missing
square was added.
• Question 1: The target variable t was referred to as Y once – this was corrected.
• Question 2: The y in generalization error is supposed to be the noisy version instead of
y∗(x) but this shouldn’t affect how you solve the question since the right hand side was correct
(bias and variance terms)
• Question 3: The regressions coefficients w1 and w2 were mistakenly referred to as alphas
in the hint of Q3.1.1. This was corrected.
• Question 4: A new hint was added.
1
https://markus.teach.cs.toronto.edu/csc413-2020-01
2
https://csc413-2020.github.io/assets/misc/syllabus.pdf
3
https://piazza.com/class/k58ktbdnt0h1wx?cid=1
1
CSC413/2516 Winter 2020 with Professor Jimmy Ba Homework 3
1 Weight Decay
Here, we will develop further intuitions on how adding weight decay can influence the solution
space. For a refresher on generalization, please refer to: https://csc413-2020.github.io/
assets/readings/L07.pdf. Consider the following linear regression model with weight decay.
J (wˆ ) = 1
2n
kXwˆ − tk
2
2 +
λ
2
wˆ
>wˆ
where X ∈ R
n×d
, ❅❅Y t ∈ R
n
, and wˆ ∈ R
d
. n is the number of data points and d is the data
dimension. X is the design matrix in HW1.
1.1 Underparameterized Model [0pt]
First consider the underparameterized d ≤ n case. Write down the solution obtained by gradient
descent assuming training converges. Is the solution unique? If the solution involves inverting
matrices, explain why it is invertible.
1.2 Overparameterized Model
1.2.1 Warmup: Visualizing Weight Decay [1pt]
Now consider the overparameterized d > n case. We start with a 2D example from HW1. For a
single training example x1 = [2, 1] and t1 = 2. First, 1) draw the solution space of the squared
error on a 2D plane. Then, 2) draw the the coutour plot of the weight decay term λ
2 wˆ
>wˆ .
Include the plot in the report. Also indicate on the plot where the gradient descent solutions
are with and without weight decay. (Precious drawings are not required for the full mark.)
1.2.2 Gradient Descent and Weight Decay [0pt]
Derive the solution obtained by gradient descent at convergence in the overparameterized case. Is
this the same solution from Homework1 3.4.1?
1.3 Adaptive optimizer and Weight Decay [1pt]
In HW2 Section 1.2, we saw that per-parameter adaptive methods, such as AdaGrad, Adam, do
not converge to the least norm solution due to moving out of the row space of our design matrix
X.
Assume AdaGrad converges to an optimal in the training objective. Does weight decay help
AdaGrad converge to a solution in the row space? Give a brief justification.
(Hint: build intuition from the 2-D toy example.)
2 Ensembles and Bias-variance Decomposition
In the prerequisite CSC311 https://amfarahmand.github.io/csc311/lectures/lec04.pdf, we
have seen the bias-variance decomposition. The following question uses the same notation as taught