Description
1. Neural Networks
Consider a basic multi-layer perceptron (MLP) with three layers, characterized by the following weight matrices:
• W0 ∈ R
2×2
• W1 ∈ R
3×2
• W2 ∈ R
1×3
Answer the following questions regarding the structure and behavior of this neural network.
1.1 Basic MLP
a) How many weights are there in this neural network? Show the calculation.
b) Compute the output of the network yˆ for the input x = (1, 1)⊤. Use the weights:
W0 =
0.1 0.4
−0.5 0.6
, W1 =
0.2 −0.1
0.5 0.4
0.3 −0.6
, W2 = [0.7, −0.4, 0.9]
Assume ReLU activation and show the calculation steps.
c) Compare the effect of using ReLU versus sigmoid activation in layer 1 only. Compute the output for the
same network and input x, and summarize the differences.
d) Compute ∂L
∂W2
, assuming L =
1
2
(ˆy − y)
2
. Show its shape and intermediate steps for the forward and
backward passes.
1.2 Revised MLP-1: MLP with Missing Weights
a) For the following network structure, calculate the output yˆ for x = (1, 1)⊤:
W0 =
0.2 0.3
⋆ −0.1
, W1 =
0.4 ⋆
⋆ 0.6
−0.2 0.1
, W2 = [0.3, 0.5, ⋆]
Missing entries (denoted ⋆) are zero. Use ReLU activation.
b) Assume layer-wise normalization (mean subtraction, variance normalization) applied to z1 before activation. How does it affect the output? Compute for the network above.
c) Debugging and Improving a Neural Network with PyTorch
The code file mlp weights.py provides a PyTorch code template for you to experiment with the multi-layer
perceptron (MLP) from part (b), which contains missing weights. The network is implemented with explicit
forward passes for more transparency during training.
Task: Debug and modify the given neural network so that it converges effectively during training. Initially, the
missing weights are set to zero, which prevents the network from learning effectively. Your goal is to make
modifications that enable successful convergence and analyze the outcomes.
Constraints:
• Use stochastic gradient descent (SGD) as the optimizer.
• Maintain the three linear transformations defined by weight matrices W0, W1, and W2. Do not add or
remove layers.
• You may modify elements within each layer (e.g., activation functions, adding bias terms), but keep the
given structure of the weight matrices intact.
Steps to Follow:
i) Initial Run: Run the provided network as-is. Document the loss curve and explain why the network fails to
converge.
ii) Modify and Observe: Make one modification at a time. Document each noteworthy change, including plots
showing the effects on loss and parameter evolution.
iii) Analyze: Compare the different modifications. Which ones improved convergence? Discuss how the initial
zero values for weights led to degenerate learning, and explain how your modifications addressed these
issues.
Note: Since we have only a single input and output, the ”stochastic” element of SGD is not present here—it
effectively functions as standard gradient descent.
In Your Report:
• Describe the initial issues observed.
• Summarize the modifications you made and why.
• Include plots for the loss curve and parameter evolution.
• Analyze the impact of different changes, focusing particularly on the role of initialization in achieving effective learning.
Time Frame: Spend approximately 30-60 minutes experimenting until you achieve a satisfactory solution.
d) Exploring the Role of Stochasticity in Overcoming Initialization Issues
Consider the problem of zero initialization in neural networks. Could true stochastic gradient descent (SGD)
overcome the issue of zero-initialized weights? To better understand this, please answer the multiple-choice
questions by editing the text file sgd mc answers.txt located in the A3 folder. Don’t answer them in the
report you submit.
Below are the multiple-choice questions included in the file:
a) Why is stochastic gradient descent (SGD) called ”stochastic”?
(a) Because it uses random mini-batches of data to approximate the gradient.
(b) Because it computes the gradient over the entire dataset.
(c) Because the learning rate changes at each step.
(d) Because it always converges faster.
b) What is the effect of initializing all weights to zero in a neural network?
(a) All neurons in a layer receive identical gradients and learn the same features.
(b) The network will converge faster due to symmetry.
(c) Stochastic updates will make neurons learn distinct features.
(d) Zero initialization always helps prevent overfitting.
c) Could true stochastic gradient descent overcome the problem of zero-initialized weights?
(a) Yes, because random mini-batches would introduce sufficient variability.
(b) No, because all neurons have identical weights and thus receive identical gradients, regardless of
mini-batch variability.
(c) Yes, because SGD inherently breaks symmetry.
(d) No, but adding bias terms would always fix it.
2. SVM
Support Vector Machines (SVM) Assignment
This assignment will focus on applying Support Vector Machines (SVM) to a dataset representing simplified
customer behavior metrics. The objective is to train an SVM classifier to distinguish between two customer
segments based on their engagement level with a service.
The dataset, D, contains the following features for each customer:
• x1: Average number of visits to the service per week.
• x2: Average spending per visit (in tens of dollars).
The labels are defined as follows:
• +1: High-engagement customer segment.
• −1: Low-engagement customer segment.
The given dataset consists of the following samples:
D = {((2, 6), 1),((6, 2), −1),((4, 6), 1),((5, 3), −1)}
Recall the soft-margin SVM formulation:
min
w,b
1
2
∥w∥
2 + C
X
N
i=1
ξi
subject to:
yi(wT xi − b) ≥ 1 − ξi
, ξi ≥ 0, ∀i
a) Feasibility of (w1, b1)
Given w1 = (0.5, −0.5) and b1 = 1, determine whether (w1, b1) is a feasible solution to the soft-margin SVM
problem. Verify if all the data points satisfy the constraint for the given w1 and b1.
b) Determine the Optimal Solution (w∗
, b∗
)
Determine the optimal solution (w∗
, b∗
) for the soft-margin SVM problem for the given dataset. Use a linear kernel for simplicity. Create a coordinate system showing the data points (marking positive and negative examples),
the optimal decision boundary line w∗T x − b
∗ = 0, and the margin boundaries parallel to the decision boundary
at distance 1/∥w∗∥ from the decision boundary.
• Assume C = 1. Explain the impact of different values of C on the decision boundary.
• Calculate the objective value of the primal solution.
c) Support Vectors and Their Role
Identify the support vectors for the given dataset and explain their role in defining the decision boundary. Which
data points are support vectors in this case, and why?
d) Exploring the Dual SVM Solution
Determine the optimal objective value for the dual SVM problem and find the corresponding dual variables
(λ1, λ2, . . .). Use the generalized Lagrangian approach to derive the weight vector and explain how the dual
formulation relates to the primal SVM solution.
Use the provided Python script (solve lambdas.py) to determine the optimal objective value for the dual
SVM problem. Complete the TODO in the code to define the Gram (Kernel) matrix K. Once you have filled this
in, run the code to compute the dual variables (λ1, λ2, . . . ).
Tasks:
a) Fill in the Kernel Computation: Complete the computation for the Gram matrix K in solve lambdas.py.
b) Run the Code: Execute the script to determine the optimal λi values and identify the support vectors.
c) Derive the Weight Vector: Using the generalized Lagrangian for the SVM, take the partial derivative with
respect to the weight vector w and set it to zero to derive the weight vector. Refer to the lecture slides on the
SVM dual for guidance on the critical point condition involving w and λi
. This step will help you link the
dual variables (λi) to the weight vector.
d) Verification Task: Use the computed λi values from the script to manually verify the optimal weight vector
w∗ using the derived formula. Compare your calculation to the value computed by the code.
e) Relationship Between Primal and Dual: Explain how the derived w∗
relates to the primal SVM problem
and how the support vectors influence the solution.
f) Considerations for Scaling to Larger Datasets
Discuss how the analysis methods used in this assignment would change if the dataset were significantly larger,
for instance, involving 100,000 customers. Consider aspects such as computational efficiency, model complexity,
and practical implementation challenges.