Description
Policies
Coding: You must write your own code. You may use any numerical linear algebra package, but you may
not use machine learning libraries (e.g. sklearn, tensorflow) unless otherwise specified. In addition, each
student must write and submit their own code in the programming part of the assignment (we may run your
code).
Acknowledgments: We expect you to make an honest effort to solve the problems individually. As we
sometimes reuse problem set questions from previous years, covered by papers and webpages, we expect the
students not to copy, refer to, or look at the solutions in preparing their answers (referring to unauthorized
material is considered a violation of the honor code). Similarly, we expect to not to google directly for
answers (though you are free to google for knowledge about the topic). If you do happen to use other
material, it must be acknowledged here, with a citation on the submitted solution.
Readings
Read the required material.
Required HW submission format:
The following is the required HW submission format:
• Submit all the answers to the HW as one single typed pdf document (not handwritten). This document
must contain all plots. It is encouraged you latex all your work, though you may use another comparable
typesetting method.
• Additionally, submit your code as a separate archive file (a .tar or .zip file). All code must be submitted.
The code should be in a runnable file format like .py files or .txt files. Jupyter notebooks are not
acceptable.
• List the names of all people you collaborated with and for which question(s). Please mark this as
Question 0. Each student must understand, write, and hand in their own answers. Also, each student
must understand, write, and hand in their own code.
1
0 Collaboration and Acknowledgements
List the names of all people you collaborated with and for which question(s). Each student must understand,
write, and hand in their own answers. Also, each student must understand, write, and hand in their own
code.
1 PCA and reconstruction [25 points]
Let’s do PCA and reconstruct the digits in the PCA basis.
As you will be making many visualizations of images, plot these images in reasonable way (e.g. make the
images smaller).
1.1 Matrix Algebra Review [5 points]
The trace of a square matrix M, denoted, by Tr(M) is defined as the sum of the diagonal entries of M.
1. (2 points) Show that Tr(AB) = Tr(BA) for two matrices A and B of size n × d.
2. (3 points) Now we prove a few claims that helpful in the next problem. Define Σ := 1
nXX, where X
is the n × d data matrix (and d ≤ n). Let Xi be the i-th row (so Xi
is a d-vector). Let λ1, λ2, . . . λd
be the eigenvalues of Σ. Show that:
Tr(Σ) = X
d
i=1
λi =
1
n
Xn
i=1
kXik
2
1.2 PCA [8 points]
Define Σ, a 784×784 matrix, as follows:
Σ = 1
N
X
i
xix
i
where the x’s are points in our training set.
Now compute the top 50 PCA dimensions; these are the 50 dimensions which best reconstruct the data.
1. (2 points) What are the eigenvalues λ1, λ2, λ10, λ30, and λ50? Also, what is the sum of eigenvalues
Pd
i=1 λi? (Hint: use the answer from the previous question)
2. (3 points) It is straight forward to see that the fractional reconstruction error of using the top k out
of d directions is 1 −
Pk
Pi=1 λi
d
i=1 λi
. Plot this fractional reconstruction error from 1 to 50? (so the X-axis is
k and the Y -axis is the fractional reconstruction error).
3. (1 points) What does the first eigenvalue capture, and why do you think λ1 is much larger than the
other eigenvalues?
4. (2 points) To view things more easily, plot the fractional reconstruction error from dimensions 2 to 50?
(so the X-axis is from 2 to 50)
2
1.3 Visualization of the Eigen-Directions [4 points]
Now let us get a sense of the what the top P CA directions are capturing (recall these are the directions
which capture the most variance).
1. (3 points) Display the first 10 eigenvectors as images.
2. (1 points) Provide a brief interpretation of what you think they capture.
1.4 Visualization and Reconstruction [8 points]
1. (1 points) Choose 5 digits (make sure they are from different classes). Visualize these digits. In
particular, include pictures of these 5 digits.
2. (6 points) Now let us visualize these digits, after we project them onto the top k eigenvectors of Σ. In
particular, if U is the d × k matrix of eigenvectors, the reconstruction matrix will be UU .
(a) Do this for k = 2.
(b) Do this for k = 5.
(c) Do this for k = 10.
(d) Do this for k = 20.
(e) Do this for k = 50.
3. (1 points) Provide a brief interpretation, in terms of your perceptions of the quality of these reconstructions
2 Let’s get to state of the art on MNIST! [45 points]
We will now shoot to get to “state of the art” on MNIST, without “distortions” or “pre-processing”. The
table in http://yann.lecun.com/exdb/mnist/, shows which strategies use this pre-processing.
If we wanted even higer accuracy, we should really move to convolutional methods, which we will discuss
later in the class.
SK: I am altering this problem to make it more stable and easier to get to state of the art. If you used
the previous Kernel method, then I found it harder to get to near to state of the art (in a robust manner).
I’ll still give credit if you go with the previous version (and with a lower test error). However, here I’ll lay
out a simpler and more robust scheme, which uses Fourier features (these actually work extremely well in
practice).
Dimension Reduction with PCA
Project each image onto the top 50 PCA directions. This reduces the dimension of each image from 784 to
50. The reason for doing this is to speed up the computation 1
.
1
In this particular case, PCA actually seems to improve classification performance non-trivially.
3
Random Fourier Features to approximate the RBF Kernel
Now we will map each x to a k-dimensional feature vector as follows:
x → (h1(x), h2(x), . . . hk(x))
We will use k = N = 60, 000. Each hj will be of the form:
hj (x) = sin vi
· x
σ
To construct all these vectors v1, v2, . . . vk, you will independently sample every coordinate for every vector
from a standard normal distribution (with unit variance). Here, you can think of σ as a bandwidth parameter.
In practice, setting σ is often done with the ’median trick’, which is the median of the pairwise distances
(between the x’s) in your dataset. As this a scalar parameter, I often set σ by: First, I just randomly grab
a few pairs of points and estimate the mean distance between a random pair of points (I often use the mean
rather than the median). Second, I usually multiplicatively cut it down by some factor (maybe 2, 4, 8, …
depending on the problem).
SK: You can read more about this at: https: // people. eecs. berkeley. edu/ ~ brecht/ kitchensinks.
html . Technically, you should really use sin’s and cosine’s (or a random phase). For this problem, just sin’s
works fine.
Tips (and some of my own practices)
SK: For this problem, I found cutting the mean down by a factor of 2 works well (for the bandwidth). I used
a batch size of 10, and I get under 2% in one pass through the dataset.
Let’s optimize with SGD (there are not many other options out there for problems of this size).
Stochastic optimization is still a bit of an art on large problems. Practitioners tend to have their own best
practices, and the exercises are designed to give you experience developing your own. Here are some tips,
broadly speaking (not necessarily in the context of this problem):
• (memory) As the dimension is large in this problem, we seek to avoid explicitly computing and storing
the full data matrix, which is of size k × N = N × N. Instead, you can compute the feature vector
h(x) ’on the fly’, i.e. you recompute the vector h(x) whenever you access image x.
• (regularization) As the dimension is N, note that without regularization and with exact optimization,
you would have 0 square loss (and horribly overfit). However (at least for the square loss case), as
you are using the SGD, I would consider starting without regularization, as SGD implicitly regularizes
(though this is your decision, and if you keep optimizing for a long time, then you should eventually
overfit). You can decide if you need to add any regularization (this is could be more of an issue for the
logistic case due to the possibility of large weights. Be aware of this).
• (mini-batching) Mini-batching helps. I’d consider doing it. Typically, there are diminishing returns in
terms of how much it helps.
• (learning rates, square loss case) With the square loss, I like to set my learning rates pretty large. I
actually start by finding where things diverge (for the square loss, it really does diverge. for logistic,
weird things happen). Then I usually just cut this down roughly by a factor of 2 and just use that
for the remainder of the time. In fact, for the convex case, I then average my parameters (in a way I
discuss later in the problem). With averaging, I rarely ever turn down the learning rates.
• (learning rates, logistic loss case) I would consider something similar to the previous tips for the square
loss scheme. Some concerns: 1) the weights for logistic tend to become larger than for the square
4
loss case; it is worth deciding if you need to add regularization. 2) If you are over-parameterizing by
using 10 weight vectors (rather than 9) — this was a discussion board question, where I commented
that this is usually ’ok’ — you might want to check this is not causing you numerical problems (think
about why numerical problems might arise due to the over-parameterization). 3) mini-batching helps
to make logistic more stable.
Part of the difficulty is that the schemes specified by the theory guarantee convergence, though they are
somewhat pessimistic in practice. I’ll make a discuss board post more on this point.
2.1 Least Squares [45 points]
Let us optimize the 0/1 loss with the goal of obtaining good classification accuracy (on the test set).
In practice, instead of complete randomization, we often run in epochs, where we randomly permute our
dataset and then pass through out dataset in this permuted order. Each time we start another epoch, we
permute our dataset again. Note that if you mini-batch with with m samples, than one epoch will consist
of N/m updates.
Let us plot the losses of two quantities. Let us keep around your parameter vector wt (your weight vector
at iteration t). Let us also track the average weight vector wτ over the last epoch, i.e. wτ is the average
parameter vector over the τ -th epoch.
Define the total square loss as the sum square loss over the 10 classes (so you do not divide by 10 here).
1. (5 points) Specify your parameter choices: your learning rate (or learning rate scheme), mini-batch
size, and the kernel bandwidth.
2. (10 points) You should have one plot showing the both the squared loss after every epoch (starting
with your initial squared error). Please label your axes in a readable way. Plot the loss of both wt and
the average weight vector wτ . You shave both the training and test losses on the same plot (so there
should be four curves).
3. (10 points) For the 0/1 loss, do the same, except start your plots when the 0/1 loss is below 5% so
they are more readable. Again, there should be four curves.
4. (5 points) What is your final training squared loss and 0/1 loss? What is the total number of mistakes
that are made on the training set of your final point?
5. (15 points) What is your final training squared loss and 0/1 loss? What is the total number of mistakes
that are made on the test set of your final point?
2.2 EXTRA CREDIT: Softmax Classification [20 points]
Now let us address the same problem with the softmax classifier, with the log loss. Partial credit will be
given only if you attempt all parts of this problem. Also, use the random Fourier features for this problem.
Use your same choice of the kernel bandwidth as before.
1. (1 points) Specify your parameter choices: your learning rate (or learning rate scheme) and mini-batch
size.
2. (4 points) You should have one plot showing the both the log loss after every epoch (starting with your
initial log loss). Please label your axes in a readable way. Plot the loss of both wt and the average
weight vector wτ . You shave both the training and test losses on the same plot (so there should be
four curves).
5
3. (4 points) For the 0/1 loss, do the same, except start your plots when the 0/1 loss is below 5% so they
are more readable. Again, there should be four curves.
4. (3 points) What is your final training log loss and 0/1 loss? What is the total number of mistakes that
are made on the training set of your final point?
5. (8 points) What is your final training log loss and 0/1 loss? What is the total number of mistakes that
are made on the test set of your final point?
If you find it helpful in terms of speed (for plotting purposes), you are free to evaluate the training loss on a
random 10K sized subset of the training set (keep this random set fixed). You must train on the full training
set though. And, for the final reported losses, please report this on the full training set.
2.3 EXTRA CREDIT: Back to those random Neural Net Features [10 points]
Now let us use the same random features as in HW2, except now we will use k = 60000. Let us compare
how well we do with these features compared to the Fourier features (we will just use the square loss in this
case).
Partial credit will be given only if you attempt all parts of this problem.
1. (0 points) Specify your parameter choices: your learning rate (or learning rate scheme), mini-batch
size.
2. (3 points) You should have one plot showing the both the squared loss after every epoch (starting with
your initial squared error). Please label your axes in a readable way. Plot the loss of both wt and
the average weight vector wτ . You shave both the training and test losses on the same plot (so there
should be four curves).
3. (3 points) For the 0/1 loss, do the same, except start your plots when the 0/1 loss is below 5% so they
are more readable. Again, there should be four curves.
4. (1 points) What is your final training squared loss and 0/1 loss? What is the total number of mistakes
that are made on the training set of your final point?
5. (3 points) What is your final training squared loss and 0/1 loss? What is the total number of mistakes
that are made on the test set of your final point?
3 SVMs: Hinge loss and mistake bounds [6 Points]
This question is postponed to HW3.
Suppose we classify with the hypothesis hw(x) = sgn(w
x) where sgn(z) = 1 if z is positive and −1 otherwise.
Here we are dealing with binary prediction where each label is in {−1, 1}. Let us say we make a mistake on
x, y with w if y 6= sgn(w
x). The number of mistakes with w on a dataset is the classification loss.
The SVM loss function can be viewed as a relaxation to the classification loss. The hinge loss on a pair
(x, y) is defined as:
`((x, y), w) = max{0, 1 − ywx}
where x ∈ Rd and y ∈ {−1, 1}. The SVM attempts to minimize:
1
n
Xn
i=1
max{0, 1 − yiw
xi} + λkwk
2
6
where λ is a regularizer (there are different conventions as to how we write the regularizer, e.g. sometimes
we write λ
n
instead of just λ).
The hinge loss provides a way of dealing with datasets that are not separable. Let us go through the argument
as to why we view this as a relaxation of finding the max margin classifier.
1. (2 Points) Argue that the function `((x, y), w) = max{0, 1 − ywx} is convex (as a function of w).
2. (2 Points) (Margin mistakes) Suppose that for some w we have a correct prediction of yi with xi
, i.e.
yi = sgn(w
xi). What range of values can the hinge loss, `((xi
, yi), w), take on this correctly classified
example? Points which are classified correctly and which have non-zero hinge loss are referred to as
margin mistakes.
3. (2 Points) (Mistake bound) Let M(w) be the number of mistakes made by w on our dataset (in terms
of the classification loss). Show that:
1
n
M(w) ≤
1
n
Xn
i=1
max{0, 1 − yiw
xi}
In other words, the average hinge loss on our dataset is an upper bound on the average number of
mistakes we make on our dataset.
4 Fitting an SVM classifier by hand [9 Points]
(Source: Murphy text, Exercise 14.1) Consider a dataset with 2 points in 1d:
(x1 = 0, y1 = −1) and (x2 =
√
2, y2 = 1).
Consider mapping each point to 3d using the feature vector φ(x) = [1,
√
2x, x2
]
T
(This is equivalent to using
a second order polynomial kernel).
When we examined the perceptron algorithm in class, we related the number of mistakes to the margin of the
data. If our data are linearly separable, then a natural method is to try to find a classifier which maximizes
the margin. Here, the max margin classifier has the form
ˆw, wˆ0 = arg min||w||2
s.t. (1)
y1(w
T φ(x1) + w0) ≥ 1 (2)
y2(w
T φ(x2) + w0) ≥ 1 (3)
1. (1 Points) Write down a vector that is parallel to the optimal vector ˆw. Hint: Recall from Figure
14.12 (page 500 in the Murphy text) that ˆw is perpendicular to the decision boundary between the
two points in the 3d feature space.
2. (2 Points) What is the value of the margin that is achieved by this ˆw? Hint: Recall that the margin
is the distance from each support vector to the decision boundary. Hint 2: Think about the geometry
of the points in feature space, and the vector between them.
3. (2 Points) Solve for ˆw, using the fact the margin is equal to 1/||ˆw||.
7
4. (2 Points) Solve for ˆw0 using your value for ˆw and Equations 1 to 3. Hint: The points will be on
the decision boundary, so the inequalities will be tight. A “tight inequality” is an inequality that is as
strict as possible. For this problem, this means that plugging in these points will push the left-hand
side of Equations 2 and 3 as close to 1 as possible.
5. (2 Points) Write down the form of the discriminant function f(x) = ˆw0 + ˆwT φ(x) as an explicit function
of x. Plot the 2 points in the dataset, along with f(x) in a 2d plot. You may generate this plot
by hand, or using a computational tool like Python.
5 K-Means [20 points]
Let us now try out clustering on MNIST. You many cluster with either the projected 50 dimensional dataset
or with the full dataset. (Please specify which you used and make sure to project back to “image” space for
visualization if you worked in the lower dimensional space).
5.1 Run the algorithm [12 points]
1. (5 points) Run the K-means algorithms with K = 16. Describe your initialize procedure. Plot the
following:
(a) The squared reconstruction error vs iteration number.
(b) Let us say that the number of assignments for a mean is the number of points assigned to that
mean. Plot the number of assignments for each center in descending order.
(c) Visualize the 16 centers that you learned, and display them in an order in that corresponds to
the frequency in which they were assigned (if you use a grid, just describe the ordering).
2. (1 points) Provide a brief interpretation for the k = 16 case.
3. (6 points) Now let’s run the K-means algorithms with K = 250 (also keep track of how often each
center is assigned). Again, plot the following:
(a) The squared reconstruction error vs iteration number.
(b) Let us say that the number of assignments for a mean is the number of points assigned to that
mean. Plot the number of assignments for each center in descending order.
(c) Visualize 16 of these centers, chosen randomly. Display them in the order in an order in that
corresponds to the frequency in which they were assigned.
5.2 Classification with K-means [8 points]
We often desire that unsupervised learning helps to discover representations of our data that will be helpful
for downstream tasks of interest, such as classification. The extent to which this works in practice varies
widely. Let’s see how well it works in the context of this problem.
The training algorithm is simple. You will simply find a corresponding label for each center based on the
most frequent digit assigned to it. For classification, you will just find the closest center and then use the
corresponding label of this center.
1. (4 points) For K = 16, what are your training and test 0/1 losses?
2. (4 points) For K = 250, what are your training and test 0/1 losses?
8
5.3 Your thoughts? [0 points]
(no answer needed) Do you have thoughts on how “unsupervised methods” may bring us to state of the art?
Broadly speaking, this is one of the major challenges in ML, how we can use unlabeled data to reduce the
burden of labeled data collection (which is often time consuming and costly).
9