CS/DS541 Homework 1 to 4 solution

$90.00

Original Work ?

Download Details:

  • Name: HomeWorks-crthn2.zip
  • Type: zip
  • Size: 18.54 MB

Category: Tag: You will Instantly receive a download link upon Payment||Click Original Work Button for Custom work

Description

5/5 - (1 vote)

Homework 1 – Deep Learning CS/DS541

1. Python and Numpy Warm-up Exercises [20 pts]: This part of the homework is intended to help
you practice your linear algebra and how to implement linear algebraic and statistical operations in
Python using numpy (to which we refer in the code below as np). For each of the problems below, write
a method (e.g., problem 1a) that returns the answer for the corresponding problem.
In all problems, you may assume that the dimensions of the matrices and/or vectors that are given
as input are compatible for the requested mathematical operations. You do not need to perform
error-checking.
Note 1: In mathematical notation we usually start indices with j = 1. However, in numpy (and many
other programming settings), it is more natural to use 0-based array indexing. When answering the
questions below, do not worry about “translating” from 1-based to 0-based indexes. For example, if
the (i, j)th element of some matrix is requested, you can simply write A[i,j].
Note 2: To represent and manipulate vectors and matrices, please use numpy’s array class (not the
matrix class).
Note 3: While the difference between a row vector and a column vector is important when doing math,
numpy does not care about this difference as long as the array is 1-D. This means, for example, that
if you want to compute the inner product between two vectors x and y, you can just write x.dot(y)
without needing to transpose the x. If x and y are 2-D arrays, however, then it does matter whether
they are row-vectors or column-vectors, and hence you might need to transpose accordingly.
(a) Given two matrices A and B, compute and return an expression for A + B. [ 0 pts ]
Answer : While it is completely valid to use np.add(A, B), this is unnecessarily verbose; you really
should make use of the “syntactic sugar” provided by Python’s/numpy’s operator overloading and
just write: A + B. Similarly, you should use the more compact (and arguably more elegant)
notation for the rest of the questions as well.
(b) Given matrices A, B, and C, compute and return AB − C (i.e., right-multiply matrix A by
matrix B, and then subtract C). Use dot or np.dot. [ 1 pts ]
(c) Given matrices A, B, and C, return A⊙B+C⊤, where ⊙ represents the element-wise (Hadamard)
product and ⊤ represents matrix transpose. In numpy, the element-wise product is obtained simply
with *. [ 1 pts ]
(d) Given column vectors x and y, compute the inner product of x and y (i.e., x
⊤y). [ 1 pts ]
(e) Given matrix A and integer i, return the sum of all the entries in the ith row whose column index
is even, i.e., P
j:j is even Aij . Do not use a loop, which in Python can be very slow. Instead use
the np.sum function. [ 2 pts ]
(f) Given matrix A and scalars c, d, compute the arithmetic mean over all entries of A that are
between c and d (inclusive). In other words, if S = {(i, j) : c ≤ Aij ≤ d}, then compute
1
|S|
P
(i,j)∈S Aij . Use np.nonzero along with np.mean. [ 2 pts ]
(g) Given an (n×n) matrix A and integer k, return an (n×k) matrix containing the right-eigenvectors
of A corresponding to the k eigenvalues of A with the largest absolute value. Use np.linalg.eig.
[ 2 pts ]
1
(h) Given a column vector (with n components) x, an integer k, and positive scalars m, s, return an
(n × k) matrix, each of whose columns is a sample from multidimensional Gaussian distribution
N (x + mz, sI), where z is column vector (with n components) containing all ones and I is the
identity matrix. Use either np.random.multivariate normal or np.random.randn. [ 2 pts ]
(i) Given a matrix A with n rows, return a matrix that results from randomly permuting the
columns (but not the rows) in A. [ 2 pts]
(j) Z-scoring: Given a vector x, return a vector y such that each yi = (xi − x)/σ, where x is the
mean (use np.mean) of the elements of x and σ is the standard deviation (use np.std). [ 2 pts ]
(k) Given an n-vector x and a non-negative integer k, return a n × k matrix consisting of k copies
of x. You can use numpy methods such as np.newaxis, np.atleast 2d, and/or np.repeat. [ 2
pts ]
(l) Given a k × n matrix X =

x
(1)
. . . x
(n)

and a k × m matrix Y =

y
(1)
. . . y
(m)

,
compute an n × m matrix D =



d11 . . . d1m
.
.
.
dn1 . . . dnm


 consisting of all pairwise L2 distances dij =
∥x
(i) − y
(j)∥2. In this problem you may not use loops. Instead, you can avail yourself of numpy
objects & methods such as np.newaxis, np.atleast 3d, np.repeat, np.swapaxes, etc. (There
are various ways of solving it.) Hint: from X (resp. Y), construct a 3-d matrix that contains
multiple copies of each of the vectors in X (resp. Y); then subtract these 3-d matrices. [ 3 pts ]
2. Training 2-Layer Linear Neural Networks with Stochastic Gradient Descent [25 pts]:
(a) Train an age regressor that analyzes a (48 × 48 = 2304)-pixel grayscale face image and outputs a
real number ˆy that estimates how old the person is (in years). The training and testing data are
available here:
• https://s3.amazonaws.com/jrwprojects/age_regression_Xtr.npy
• https://s3.amazonaws.com/jrwprojects/age_regression_ytr.npy
• https://s3.amazonaws.com/jrwprojects/age_regression_Xte.npy
• https://s3.amazonaws.com/jrwprojects/age_regression_yte.npy
Your prediction model g should be a 2-layer linear neural network that computes ˆy = g(x; w) =
x
⊤w + b, where w is the vector of weights and b is the bias term. The cost function you should
optimize is
fMSE(w, b) = 1
2n
Xn
i=1
(ˆy
(i) − y
(i)
)
2
where n is the number of examples in the training set Dtr = {(x
(1), y(1)), . . . ,(x
(n)
, y(n)
)}, each
x
(i) ∈ R
2304 and each y
(i) ∈ R. To optimize the weights, you should implement stochastic gradient
descent (SGD).
Note: you must complete this problem using only linear algebraic operations in numpy – you may
not use any off-the-shelf linear regression or neural network software, as that would defeat the
purpose.
There are several different hyperparameters that you will need to optimize:
• Mini-batch size ˜n.
• Learning rate ϵ.
• Number of epochs.
In order not to cheat (in the machine learning sense) – and thus overestimate the performance
of the network – it is crucial to optimize the hyperparameters only on a validation set. (The
training set would also be acceptable but typically leads to worse performance.) To create
2
a validation set, simply set aside a fraction (e.g., 20%) of the age regression Xtr.npy and
age regression ytr.npy to be the validation set; the remainder (80%) of these data files will
constitute the “actual” training data. While there are fancier strategies (e.g., Bayesian optimization) that can be used for hyperparameter optimization, it’s often effective to just use a grid
search over a few values for each hyperparameter. In this problem, you are required to explore
systematically (e.g., using nested for loops) at least 2 different values for each hyperparameter.
Performance evaluation: Once you have tuned the hyperparameters and optimized the weights
and bias term so as to minimize the cost on the validation set, then: (1) stop training the network
and (2) evaluate the network on the test set. Report both the training and the test fMSE in the
PDF document you submit, as well as the training cost values for the last 10 iterations of gradient
descent (just to show that you actually executed your code).
3. Gradient descent: what can go wrong? [30 pts] Please enter your code in a file named
homework1 problem3.py, with one Python function (e.g., problem3a) for each subproblem.
(a) [10 pts]: The graph below plots the function f(x) that is defined piece-wise as:
f(x) =



−x
3
: x < −0.1
−3x/100 − 1/500 : −0.1 ≤ x < 3
−(x − 31/10)3 − 23/250 : 3 ≤ x < 5
1083
200 (x − 6)2 − 6183/500 : x >= 5
4 2 0 2 4 6 8 10
0
20
40
60
As you can see, the function has a long nearly flat section (sometimes known as a plateau) just
before the minimum.1 Plateaux can cause big problems during optimization. To show this:
i. Derive by hand the (piece-wise) function ∇f and implement it in Python/numpy.
ii. Use your implementation of ∇f to conduct gradient descent for T = 100 iterations. Always
start from an initial x = −3. Try using various learning rates: 1e-3, 1e-2, 1e-1, 1e0,
1e1. Plot f, ∇f, as well as superimposed dots that show the sequence ((x
(1), y(1)), . . . ,(x
(T)
, y(T)
))
of gradient descent. Use plt.legend to indicate which scatter plot corresponds to which
learning rate.
iii. Describe in 1-2 sentences what you observe during gradient descent for the set of learning
rates listed above.
iv. Find a learning rate ϵ for which gradient descent successfully converges to min f(x), and
report ϵ in the PDF file.
(b) [8 pts]: Even a convex paraboloid – i.e., a parabola in multiple dimensions that has only one local
minimum and no plateaux – can cause problems for “vanilla” SGD (i.e., the kind we’ve learned in
1The ugly constants in f were chosen to give rise to these characteristics while ensuring that it remains differentiable.
3
class so far). Examine the scatter-plot below which shows the sequence ((x
(1), y(1)), . . . ,(x
(T)
, y(T)
))
of gradient descent on a convex paraboloid f, starting at x
(1) = [1, −3]⊤, where each x ∈ R
2
. The
descent produces a zig-zag pattern that takes a long time to converge to the local minimum.
4 2 0 2 4
x1
3
2
1
0
1
2
3
4
x2
i. Speculate how the SGD trajectory would look if the learning rate were made to be very small
(e.g., 100x smaller than in the figure above).
ii. Let f(x1, x2) = a1(x1 − c1)
2 + a2(x2 − c2)
2
. Pick values for a1, a2, c1, c2 so that – when
the scatter-plot shown above is superimposed onto it – the gradient descent is realistic for
f. Rather than just guess randomly, consider: why would the zig-zag be stronger in one
dimension than another, and how would this be reflected in the function’s constants? Plot a
contour graph using plt.contour and superimpose the scatter-plot using plt.scatter. You
can find the gradient descent sequence in gradient descent sequence.txt. Note that you
are not required to find the exactly correct constants or even to estimate them algorithmically.
Rather, you can combine mathematical intuition with some trial-and-error. Your solution
should just look visually “pretty close” to get full credit. Note: to ensure proper rendering,
use plt.axis(’equal’) right before calling plt.show().
(c) [6 pts]: This problem is inspired by this paper. Consider the function f(x) = 2
3
|x|
3/2
. Derive
∇f, implement gradient descent and plot the descent trajectories ((x
(1), y(1)), . . . ,(x
(T)
, y(T)
)) for
a variety of learning rates 1e-3, 1e-2, 1e-1 and a variety of starting points. See what trend
emerges, and report it in the PDF.
(d) [6 pts]: While very (!) unlikely, it is theoretically possible for gradient descent to converge to a
local maximum. Give the formula of a function (in my own solution, I used a degree-4 polynomial),
a starting point, and a learning rate such that gradient descent converges to a local maximum
after 1 descent iteration (i.e., after 1 iteration, it reaches the local maximum, and the gradient
is exactly 0). Prove (by deriving the exact values for the descent trajectory) that this is true.
You do not need to implement this in code (and, in fact, due to finite-precision floating-point
arithmetic, it might not actually converge as intended).
Submission: Create a Zip file containing both your Python and PDF files, and then submit on Canvas. If
you are working as part of a group, then only one member of your group should submit (but make sure you
have already signed up in a pre-allocated team for the homework on Canvas).

Homework 2 – Deep Learning CS/DS541

1. A linear NN will never solve the XOR problem [10 points, on paper]: Read the description of
the XOR problem from Section 6.1 of the Deep Learning textbook: https://www.deeplearningbook.
org/contents/mlp.html. Assume that you wish to classify the 4 points in this dataset using a 2-layer
linear NN (i.e., the same model as in Homework 1, Problem 2). Then show (by deriving the gradient,
setting to 0, and solving mathematically, not in Python) that the values for w = [w1, w2]
⊤ and b that
minimize the function fMSE(w, b) in Equation 6.1 are: w1 = 0, w2 = 0, and b = 0.5 – in other words,
the best prediction line is simply flat and always guesses ˆy = 0.5.
2. Derivation of softmax regression gradient updates [20 points, on paper]: As explained in
class, let
W =

w(1)
. . . w(c)

be an m×c matrix containing the weight vectors from the c different classes. The output of the softmax
regression neural network is a vector with c dimensions such that:
yˆk = P
exp zk
c
k′=1 exp zk′
(1)
zk = x
⊤w(k) + bk
for each k = 1, . . . , c. Correspondingly, our cost function will sum over all c classes:
fCE(W, b) = −
1
n
Xn
i=1
Xc
k=1
y
(i)
k
log ˆy
(i)
k
Important note: When deriving the gradient expression for each weight vector w(l)
, it is crucial to
keep in mind that the weight vector for each class l ∈ {1, . . . , c} affects the outputs of the network for
every class, not just for class l. This is due to the normalization in Equation 1 – if changing the weight
vector increases the value of ˆyl
, then it necessarily must decrease the values of the other ˆyl
′̸=l
.
In this homework problem, please complete the following derivation that is outlined below:
Derivation: For each weight vector w(l)
, we can derive the gradient expression as:
∇w(l) fCE(W, b) = −
1
n
Xn
i=1
Xc
k=1
y
(i)
k ∇w(l) log ˆy
(i)
k
= −
1
n
Xn
i=1
Xc
k=1
y
(i)
k

∇w(l) yˆ
(i)
k

(i)
k
!
We handle the two cases l = k and l ̸= k separately. For l = k:
∇w(l) yˆ
(i)
k = complete me…
= x
(i)

(i)
l
(1 − yˆ
(i)
l
)
1
For l ̸= k:
∇w(l) yˆ
(i)
k = complete me…
= −x
(i)

(i)
k

(i)
l
To compute the total gradient of fCE w.r.t. each w(k)
, we have to sum over all examples and over
l = 1, . . . , c. (Hint:
P
k
ak = al +
P
k̸=l
ak. Also, P
k
yk = 1.)
∇w(l) fCE(W, b) = −
1
n
Xn
i=1
Xc
k=1
y
(i)
k ∇w(l) log ˆy
(i)
k
= complete me…
= −
1
n
Xn
i=1
x
(i)

y
(i)
l − yˆ
(i)
l

Finally, show that
∇bfCE(W, b) = −
1
n
Xn
i=1

y
(i) − yˆ
(i)

3. Implementation of softmax regression [25 points, in Python code]:
Train a 2-layer softmax neural network to classify images of fashion items (10 different classes, such
as shoes, t-shirts, dresses, etc.) from the Fashion MNIST dataset. The input to the network will be
a 28 × 28-pixel image (converted into a 784-dimensional vector); the output will be a vector of 10
probabilities (one for each class). The cross-entropy loss function1
that you minimize should be
fCE(w(1)
, . . . , w(10), b(1), . . . , b(10)) = −
1
n
Xn
i=1
X
10
k=1
y
(i)
k
log ˆy
(i)
k +
α
2
Xc
k=1
w(k)⊤
w(k)
where n is the number of examples and α is a regularization constant.. Note that each ˆyk implicitly
depends on all the weights W =

w(1)
, . . . , w(10)
and biases b =

b
(1), . . . , b(10)
.
To get started, first download the Fashion MNIST dataset from the following web links:
• https://s3.amazonaws.com/jrwprojects/fashion_mnist_train_images.npy
• https://s3.amazonaws.com/jrwprojects/fashion_mnist_train_labels.npy
• https://s3.amazonaws.com/jrwprojects/fashion_mnist_test_images.npy
• https://s3.amazonaws.com/jrwprojects/fashion_mnist_test_labels.npy
These files can be loaded into numpy using np.load. Each “labels” file consists of a 1-d array containing
n labels (valued 0-9), and each “images” file contains a 2-d array of size n×784, where n is the number
of images.
Next, implement stochastic gradient descent (SGD) to minimize the cross-entropy loss function on
this dataset. Regularize the weights but not the biases. Optimize the same hyperparameters as in
1
In this equation, the regularization term is not divided by n like in the lecture notes. Either equation is valid since the 1/n
can be subsumed into α. Here, for simplicity, the 1/n is omitted.
2
homework 1 problem 2 (age regression). You should also use the same methodology as for the previous
homework, including the splitting of the training files into validation and training portions.
Performance evaluation: Once you have tuned the hyperparameters and optimized the weights so
as to maximize performance on the validation set, then: (1) stop training the network and (2) evaluate
the network on the test set. Record the performance both in terms of (unregularized) cross-entropy
loss (smaller is better) and percent correctly classified examples (larger is better); put this information
into the PDF you submit.
Hint 1: it accelerates training if you first normalize all the pixel values of both the training and testing
data by dividing each pixel by 255. Hint 2: when using functions like np.sum and np.mean, make
sure you know what the axis and keepdims parameters mean and that you use them in a way that is
consistent with the math!
4. Logistic Regression [15 points, on paper]: Consider a 2-layer neural network that computes the
function
yˆ = σ(x
⊤w + b)
where x is an example, w is a vector of weights, b is a bias term, and σ is the logistic sigmoid function.
Assume we train this network using the log loss, as described in class. Moreover, suppose all the
training examples are positive. Answer the following questions about convergence. (Informally,
a sequence of numbers converges if it gets closer and closer to a specific number as the sequence
progresses. A sequence that does not converge can do different things, e.g., change erratically, or grow
towards +/−∞.) While you are not required to give formal proofs, you should explain your reasoning,
which could either be a mathematical argument or a simulation result. Put your answers into your
PDF file.
(a) Given a well-chosen learning rate: what value will the training loss converge to during gradient
descent?
(b) Given a well-chosen learning rate: will b always converge; does convergence depend on the exact
training examples; or does it never converge?
(c) Suppose the training set contains exactly 2 examples, x
(1)
, x
(2) ∈ R
2
. Give specific non-zero
values for these training data such that:
i. w will converge during gradient descent (given a well-chosen learning rate).
ii. w will not converge during gradient descent (no matter what the learning rate).
Create a Zip file containing both your Python and PDF files, and then submit on Canvas. If you are
working as part of a group, then only one member of your group should submit (but make sure you have
already signed up in a pre-allocated team for the homework on Canvas). Please name your submission files
using your names in the following format: (<student1’s first name student1’s last name> <student2’s first
name student2’s last name>). For example: jacob whitehill yao su.zip

Homework 3 – Deep Neural Networks CSDS/541

1. Feed-forward neural network [60 points]: In this problem you will train a multi-layer neural network
to classify images of fashion items (10 different classes) from the Fashion MNIST dataset. Similarly to
Homework 3, the input to the network will be a 28 × 28-pixel image; the output will be a real number.
Specifically, the network you create should implement a function f : R
784 → R
10, where:
z
(1) = W(1)x + b
(1)
h
(1) = relu(z
(1))
z
(2) = W(2)h
(1) + b
(2)
.
.
.
.
.
.
z
(l) = W(l)h
(l−1) + b
(l)
ˆy = softmax(z
(l)
)
The network specified above is shown in the figure below: … … …
x z
(1) z
(2) …
h
(1) …
h
(2)
W(1) W(2)
b
(2) …
^
z
(l) …
… y
b
(l) b
(1)
W(l)

As usual, the (unregularized) cross-entropy cost function should be
fCE(W(1)
, b
(1)
, . . . ,W(l)
, b
(l)
) = −
1
n
Xn
i=1
X
10
k=1
y
(i)
k
log ˆy
(i)
k
where n is the number of examples.
Hyperparameter tuning: In this problem, there are several different hyperparameters and architectural design decisions that will impact the network’s performance:
• Number of hidden layers (suggestions: {3, 4, 5})
• Number of units in each hidden layer (suggestions: {30, 40, 50})
• Learning rate (suggestions: {0.001, 0.005, 0.01, 0.05, 0.1, 0.5})
• Minibatch size (suggestions: 16, 32, 64, 128, 256)
• Number of epochs
• L2 Regularization strength applied to the weight matrices (but not bias terms)
• Frequency & rate of learning rate decay
• Variance & type of random noise added to training examples for data augmentation
• . . .
1
These can all have a big impact on the test accuracy. In contrast to previous assignments, there is no
specific requirement for how to optimize them. However, in practice it will be necessary to do so in
order to get good results.
Numerical gradient check: To make sure that you are implementing your gradient expressions
correctly, you should use the check grad (and possibly its sister function, approx fprime). These
methods take a function f (i.e., a Python method that computes a function; in practice, this will be
the regularized cross-entropy function you code) as well as a set of points on which to compute f’s
derivative (some particular values for the weights and biases). The approx fprime will return the
numerical estimate of the gradient of f, evaluated at the points you provided. The check grad also
takes another parameter, ∇f, which is what you claim is the Python function that returns the gradient
of the function f you passed in. check grad computes the discrepancy (averaged over a set of points
that you specified) between the numerical and analytical derivatives. Both of these methods require
that all the parameters of the function (in practice: the weights and biases of your neural network)
are “packed” into a single vector (even though the parameters actually constitute both matrices and
vectors). For this reason, the starter code we provide includes a method called unpack that take a
vector of numbers and extracts W(1)
, W(2), . . . , as well as b
(1)
, b
(2), . . . . Note that the training data
and training labels are not parameters of the function f whose gradient you are computing/estimating,
even though they are obviously needed by the cross-entropy function to do its job. For this reason, we
“wrap” the call to fCE with a Python lambda expression in the starter code.
Your tasks:
(a) Implement stochastic gradient descent (SGD; see Section 5.9 and Algorithm 6.4 in the Deep
Learning textbook, https://www.deeplearningbook.org/) for the multi-layer neural network
shown above. Important: your backprop algorithm must work for any number of hidden layers.
(b) Verify that your gradient function is correct using the check grad method. In particular, include
in your PDF the real-valued output of the call to check grad for a neural network with 3 hidden
layers, each with 64 neurons (the discrepancy between the numerical and analytical derivative for
this case should be less than 1e-4).
(c) Include a screenshot in your submitted PDF file showing multiple iterations of SGD (just to show
that you actually ran your code successfully). For each iteration, report both the test accuracy
and test unregularized cross-entropy. For full credit, the accuracy (percentage correctly classified
test images) should be at least 88%.
(d) Visualize the first layer of weights W(1) that are learned after training your best neural network.
In particular, reshape each row of the weights matrix into a 28 × 28 matrix, and then create a
“grid” of such images. Include this figure in your PDF. Here is an example (for W(0) ∈ R
64×784).
Recommended strategy: First, (1) implement the forward- and back-propagation phases so
that you can perfectly (as verified with check grad) compute the gradient for just a single training
example. You will likely (unless you do everything perfectly from the get-go) need to “break” the
2
gradient vector, as returned by your gradCE and the approx fprime methods, into its individual
components (for the individual weight matrices and bias terms) so that you can compare each
of them one-by-one and see where any problems lie. Next, (2) you can implement minibatches
of size ˜n > 1 by simply iterating over the minibatch in a for-loop. Finally – and only after you
have correctly implemented step (2) – replace the for-loop (which is relatively slow) with matrix
operations that compute the same result.
In addition to your Python code (homework3 WPIUSERNAME1.py
or homework3 WPIUSERNAME1 WPIUSERNAME2.py for teams), create a PDF file (homework3 WPIUSERNAME1.pdf
or homework3 WPIUSERNAME1 WPIUSERNAME2.pdf for teams) containing the screenshots described above.

Homework 4 – Deep Neural Networks CSDS/541

1. Visualizing SGD Trajectories of Fully-Connected Neural Networks (FCNNs) [25 pts]:
First, read all the “Introduction to PyTorch” tutorials (see https://pytorch.org/tutorials/beginner/
basics/intro.html), including “Training a classifier”, “Learn the Basics”, “Quickstart”, “Tensors”,
“Datasets & Dataloaders”, “Transforms”, “Build Model”, “Autograd”, “Optimization”, and “Save &
Load Model”. Then complete the following tasks:
(a) [10 pts]: Use PyTorch to build and train a simple FCNN with at least 2 hidden layers to
classify the Fashion MNIST dataset (similar to Homework 3). You can use label-preserving
transformations such as rotation (make sure not to rotate the images by more than, say, ±10◦
)
to improve generalization. Your network must be fully-connected – it cannot use convolution.
Report the test accuracy you get in the PDF.
(b) [10 pts]: For any fixed FCNN architecture with at least 2 hidden layers, visualize the gradient
descent trajectory in 3-D from two different random parameter initializations. In your plot,
you should use the first two axes to represent different values for the NN’s parameters (which we
will denote here collectively simply as p) and the third (vertical) axis to represent the crossentropy fCE(p). Of course, there are far (!) more than just 2 parameters in the NN, and thus
it will be necessary to perform dimensionality reduction. You should use principal component
analysis (PCA) to reduce the parameter space down to just 2 dimensions. The two PCs will
represent different “directions” along which the NN parameters can vary, where these directions
are chosen to minimize the reconstruction error of the data. In general, it will not be the case that
a PC corresponds to just a single weight/bias; rather, moving along each PC axis will correspond
to changing all of the NN’s parameters at once.
Concrete steps:
i. Run SGD at least two times to collect multiple trajectories of p. (Ask ChatGPT for help on
how to extract all the parameters from the entire NN as a single vector.) For each value, save
the corresponding training cost fCE(p) – you will need these later. To keep things tractable,
train the network on just 1000 examples of Fashion MNIST.
ii. Use the collected p vectors to estimate the first 2 principal components that map from the
full parameter space down to just 2 dimensions (see sklearn.decomposition.PCA).
iii. For each p that was encountered during training, project it into the 2-d space, and then plot
it as part of a 3-d scatter plot with its associated fCE(p) value that was computed during
SGD.
iv. To give a sense of the “landscape” through which SGD is “hiking”, compute a dense grid of at
least 25×25 points in the 2-d PC space. For each point p˜ in this 2-d space, project it back into
the full parameter space (see PCA.inverse transform) to obtain a value for p. Load these
parameters p into the NN (ask ChatGPT for help on how to do this). Finally, make a surface
plot (use the plot surface function in matplotlib) showing the corresponding cost values
over all points in this grid. Render this surface plot in the same figure as the 3-d scatter plot.
Make sure to compute the fCE values (for both the 3-d scatter plot and the surface plot) on
the set of 1000 training data, since that is what SGD directly optimizes. Include your figure
in the PDF you submit.
An example figure is shown below:
1
As you can see, the two SGD trajectories started on different sides of a ridge and ended up
descending into different valleys.
(c) [3 pts]: In the figure you created in part (b), the fCE values of the points in the 3-d scatter plot
(computed during SGD) do not always exactly equal the corresponding values from the surface
plot generated on the dense grid of points. Why is that, and why would it be impractical to
create a surface plot over the grid that exactly matches the “real” fCE values obtained using
SGD? (Think about how PCA works and how it is used here.) Answer in a few sentences in the
PDF.
(d) [2 pts]: Assume that the set of all the p vectors you collected has zero mean (i.e., the sum of
all the p vectors equals the zero vector). Let p, p

represent two different configurations of the
NNs parameters, and let p˜, p˜
′ ∈ R
2
represent their respective projections in the 2-d PC space.
Furthermore, let pˆ, pˆ

represent the reconstructions (using PCA.inverse transform) of the NN’s
parameters from p˜, p˜

.
Which of the following statements are always true? Report the correct statements in your PDF.
i. If p = 2p

(i.e., the NN parameters in the first configuration are all twice the magnitude of
the corresponding parameters in the second configuration), then p˜ = 2p˜

.
ii. If p˜ = 2p˜

, then p = 2p

.
iii. If p˜ = 2p˜

, then pˆ = 2pˆ

.
iv. fCE(pˆ) ≤ fCE(p).
2. Simple CNN for Fashion MNIST [15 pts]: Read the following PyTorch tutorial: https://
pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html.
Then, apply the same methodology to the Fashion MNIST dataset (see the torchvision.datasets.FashionMNIST
class). Note that, since Fashion MNIST images are grayscale, they have just a single color channel
rather than 3. Hence, you will need to adapt the CNN slightly so that the number of input channels
in the first convolutional layer is 1 instead of 3. Also, the normalization step transforms.Normalize
will use just 1 channel instead of 3. Finally, as the image size is different compared to CIFAR10, the
size of the feature maps will also be different. You will thus need to update the number of incoming
neurons to the first fully-connected (nn.Linear) layer accordingly. After making these modifications,
train and evaluate a classifier on this dataset. Report the test accuracy in the PDF you submit.
3. Supervised Pre-Training and Fine-Tuning of CNNs [15 pts]: Read the following PyTorch
tutorial: https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html.
2
Then, apply the same methodology to the Fashion MNIST dataset. Report the test accuracies in
the PDF you submit, using either (a) fine-tuning of the whole model or (b) training just the final
(classification) layer.
4. CNNs for Behavioral Cloning in Pong [20 pts]: Train an AI agent to play Pong (see https:
//www.gymlibrary.dev/environments/atari/pong/). In this game, each player can execute one of
6 possible actions at each timestep (NOOP, FIRE, RIGHT, LEFT, RIGHTFIRE, and LEFTFIRE).
The goal is to execute the best action at each timestep based on the current state of the game.
To get started, first download the following files:
• https://s3.amazonaws.com/jrwprojects/pong_actions.pt
• https://s3.amazonaws.com/jrwprojects/pong_observations.pt
Together, these files define (image, action) pairs generated by an expert player from the Atari Pong
video game. Using PyTorch, implement and train any NN architecture you choose (I recommend
a simple CNN) to map the images to their corresponding expert actions. Your NN will implement
the control policy that dictates how the agent behaves in differnet situations, and the approach of
training this NN in a supervised manner from expert trajectories is called behavioral cloning. After
training your NN, save it to a file (use torch.save). Then, load the model in play atari.py and
see how well your AI “player” does against the computer. To receive full credit, your trained agent
should be able to beat the computer (i.e., reach 21 points first). (Note: for this part of the homework,
you need to use a standard Python interpreter – Google Colab cannot render the frames of the video
game.)
In addition to your Python code (homework4 WPIUSERNAME1.py
or homework4 WPIUSERNAME1 WPIUSERNAME2.py for teams), create a PDF file (homework4 WPIUSERNAME1.pdf
or homework4 WPIUSERNAME1 WPIUSERNAME2.pdf for teams) containing the screenshots described above.