Description
1. On initialization.
Consider a 2-layer network
f(x;W, v) = Xm
j=1
vjσ
⟨wj , x⟩
,
where x ∈ R
d
, W ∈ R
m×d with rows w⊤
j
, and v ∈ R
m. For simplicity, the network has a single output, and
bias terms are omitted.
Given a data example (x, y) and a loss function ℓ, consider the empirical risk
Rb(W, v) = ℓ
f(x;W, v), y
.
Only a single data example will be considered in this problem; the same analysis extends to multiple examples
by taking averages.
(a) [hw3] For each 1 ≤ j ≤ m, derive ∂Rb/∂vj and ∂Rb/∂wj .
(b) [hw3] Consider gradient descent which starts from some W(0) and v
(0), and at step t ≥ 0, updates the
weights for each 1 ≤ j ≤ m as follows:
w
(t+1)
j = w
(t)
j − η
∂Rb
∂w
(t)
j
, and v
(t+1)
j = v
(t)
j − η
∂Rb
∂v(t)
j
.
Suppose there exists two hidden units p, q ∈ {1, 2, . . . , m} such that w
(0)
p = w
(0)
q and v
(0)
p = v
(0)
q . Prove
by induction that for any step t ≥ 0, it holds that w
(t)
p = w
(t)
q and v
(t)
p = v
(t)
q .
Remark: as a result, if the neural network is initialized symmetrically, then such a symmetry may
persist during gradient descent, and thus the representation power of the network will be limited.
(c) [hw3] Random initialization is a good way to break symmetry. Moreover, proper random initialization
also preserves the squared norm of the input, as formalized below.
First consider the identity activation σ(z) = z. For each 1 ≤ j ≤ m and 1 ≤ k ≤ d, initialize
w
(0)
j,k ∼ N (0, 1/m) (i.e., normal distribution with mean µ = 0 and variance σ
2 = 1/m). Prove that
E
h
W(0)x
2
2
i
= ∥x∥
2
2
.
Remark: This is similar to torch.nn.init.kaiming normal ().
Next consider the ReLU activation σr(z) = max{0, z}. For each 1 ≤ j ≤ m and 1 ≤ k ≤ d, initialize
w
(0)
j,k ∼ N (0, 2/m). Prove that
E
h
σr(W(0)x)
2
2
i
= ∥x∥
2
2
.
Hint: linear combinations of Gaussians are again Gaussian! For the second part (with ReLU), consider
the symmetry of a Gaussian around 0.
Solution.
2
2. ResNet.
In this problem, you will implement a simplified ResNet. You do not need to change arguments which are
not mentioned here (but you of course could try and see what happens).
(a) [hw3code] Implement a class Block, which is a building block of ResNet. It is described in (He et al.,
2016) Figure 2.
The input to Block is of shape (N, C, H, W), where N denotes the batch size, C denotes the number of
channels, and H and W are the height and width of each channel. For each data example x with shape
(C, H, W), the output of block is
Block(x) = σr
x + f(x)
,
where σr denotes the ReLU activation, and f(x) also has shape (C, H, W) and thus can be added to x.
In detail, f contains the following layers.
i. A Conv2d with C input channels, C output channels, kernel size 3, stride 1, padding 1, and no bias
term.
ii. A BatchNorm2d with C features.
iii. A ReLU layer.
iv. Another Conv2d with the same arguments as i above.
v. Another BatchNorm2d with C features.
Because 3 × 3 kernels and padding 1 are used, the convolutional layers do not change the shape of each
channel. Moreover, the number of channels are also kept unchanged. Therefore f(x) does have the same
shape as x.
Also, implement the option to use SiLU instead of ReLU, and LayerNorm instead of BatchNorm2d.
Additional instructions are given in doscstrings in hw3.py.
(b) [hw3] Explain why a Conv2d layer does not need a bias term if it is followed by a BatchNorm2d layer.
(c) [hw3code] Implement a (shallow) ResNet consists of the following parts:
i. A Conv2d with 1 input channel, C output channels, kernel size 3, stride 2, padding 1, and no bias
term.
ii. A BatchNorm2d with C features.
iii. A ReLU layer.
iv. A MaxPool2d with kernel size 2.
v. A Block with C channels.
vi. An AdaptiveAvgPool2d which for each channel takes the average of all elements.
vii. A Linear with C inputs and 10 outputs.
Also, implement the option to use SiLU instead of ReLU, and LayerNorm instead of BatchNorm2d.
Additional instructions are given in doscstrings in hw3.py.
(d) [hw3code] Implement fit and validate for use in the next part. Please do not shuffle the inputs when
batching in this part! The utility function loss batch will be useful. See the docstrings in hw3.py and
hw3 util.py for details.
Remark: be careful to invoke net.train() and net.eval() in the correct places.
(e) [hw3] Using fit and validate(), train a ResNet with 16 channels on the data given by hw3 utils.torch digits(),
using the cross entropy loss and SGD with learning rate 0.005 and batch size 16, for 30 epochs. Plot
the epochs vs training and validation cross entropy losses. Since there is some inconsistency due to
random initialization, try 3 runs and have 3 plots. Repeat this for each combination of ReLU/SiLU and
BatchNorm2d/LayerNorm, for a total of 12 plots. Include these 12 plots in your written submission.
Do you notice any significant differences/improvements between the different combinations of activation
functions and normalization layers? Include at least one observation in your written submission.
Solution.
3
3. RBF kernel and nearest neighbors.
(a) [hw3] Recall that given data examples ((xi
, yi))n
i=1 and an optimal dual solution (αˆi)
n
i=1, the RBF kernel
SVM makes a prediction as follows:
fσ(x) = Xn
i=1
αˆiyi exp
−
∥x − xi∥
2
2
2σ
2
!
=
X
i∈S
αˆiyi exp
−
∥x − xi∥
2
2
2σ
2
!
,
where S ⊂ {1, 2, . . . , n} is the set of indices of support vectors.
Given an input x, let T := arg mini∈S ∥x − xi∥2 denote the set of closest support vectors to x, and let
ρ := mini∈S ∥x − xi∥2 denote this smallest distance. (In other words, T := {i ∈ S : ∥x − xi∥ = ρ}.)
Prove that
limσ→0
fσ(x)
exp
−ρ
2/2σ
2
=
X
i∈T
αˆiyi
.
Remark: in other words, when the bandwidth σ becomes small enough, RBF kernel SVM is almost the
1-nearest neighbor predictor with the set of support vectors as the training set.
(b) [hw3] Consider the XOR dataset:
x1 = (+1, +1), y1 = +1,
x2 = (−1, +1), y2 = −1,
x3 = (−1, −1), y3 = +1,
x4 = (+1, −1), y4 = −1.
Verify that αˆ = (1/α, 1/α, 1/α, 1/α) is an optimal dual solution to the RBF kernel SVM, where
α =
1 − exp
−
∥x1 − x2∥
2
2
2σ
2
!
2
=
1 − exp
−
2
σ
2
!2
> 0.
Hint: prove that the gradient of the dual function is 0 at αˆ . Since the dual function is concave, and
αˆ > 0, it follows that αˆ is an optimal dual solution.
Remark: in other words, all four data examples are mapped to support vectors in the reproducing
kernel Hilbert space. In light of (a), when σ is small enough, fσ(x) is almost the 1-nearest neighbor
predictor on the XOR dataset. In fact, it is also true for large σ, due to the symmetry of the XOR data.
Solution.
4
4. LLM Use and Other Sources.
[hw3] Please document, in detail, all your sources, including LLMs, friends, internet resources, etc. For
example:
1a. I asked my friend, then I found a different way to derive the same solution.
1b. ChatGPT 4o solved the problem in one shot, but then I rewrote it once on paper, and a few days later
tried to re-derive an answer from scratch.
1c. I accidentally found this via a google search, and had trouble forgetting the answer I found, but still
typed it from scratch without copy-paste.
1d. . . .
.
.
.
4. I used my solution to hw1 problem 5 to write this answer.
Solution.
5
References
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In
Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.

