## Description

Policies

Coding: You must write your own code. You may use any numerical linear algebra package, but you may

not use machine learning libraries (e.g. sklearn, tensorflow) unless otherwise specified. In addition, each

student must write and submit their own code in the programming part of the assignment (we may run your

code).

Acknowledgments: We expect you to make an honest effort to solve the problems individually. As we

sometimes reuse problem set questions from previous years, covered by papers and webpages, we expect the

students not to copy, refer to, or look at the solutions in preparing their answers (referring to unauthorized

material is considered a violation of the honor code). Similarly, we expect to not to google directly for

answers (though you are free to google for knowledge about the topic). If you do happen to use other

material, it must be acknowledged here, with a citation on the submitted solution.

Readings

Read the required material.

Required HW submission format:

The following is the required HW submission format:

• Submit all the answers to the HW as one single typed pdf document (not handwritten). This document

must contain all plots. It is encouraged you latex all your work, though you may use another comparable

typesetting method.

• Additionally, submit your code as a separate archive file (a .tar or .zip file). All code must be submitted.

The code should be in a runnable file format like .py files or .txt files. Jupyter notebooks are not

acceptable.

• List the names of all people you collaborated with and for which question(s). Please mark this as

Question 0. Each student must understand, write, and hand in their own answers. Also, each student

must understand, write, and hand in their own code.

1

0 Collaboration and Acknowledgements

List the names of all people you collaborated with and for which question(s). Each student must understand,

write, and hand in their own answers. Also, each student must understand, write, and hand in their own

code.

1 Programming Question: Multi-Class Classification using Least

Squares [20 Points]

Obtain the MNist dataset from http://yann.lecun.com/exdb/mnist/.

In HW1, we built a binary classifier. Let us now build a multi-class classifier using regression.

1.1 One vs all classification

Let us do multi-class classification by solving 10 binary classification problems, which is sometimes referred

to as “one against all”. For the k-th binary classification problem, you will do a linear regression where you

label a digit as Y = 1 if and only if the label for this digit is k (for k = 0, 1, 2, . . . 9).

For classification, you will then take the largest predicted score among your 10 predictors.

1. (2 points) Naively, let us suppose you solve each classification problem separately, by doing k linear

regressions (in d dimensions with n points). What is the total computational complexity of computing

the solution for one of these linear regression problems, in terms of n and d? Take into to account the

construction time of computing all matrices and vectors used in computing the (exact) least squares

solution. (Assume that matrix inversion of a d × d matrix takes time d

3

, eventhough theoretically

faster runtimes are known). Note that if you solve all k problems in this manner, your run time will

be a factor of k larger than the time it takes to solve one of these regression problems.

2. (3 points) Provide an improved runtime for solving all k problems jointly. You should improve upon

the naive algorithm of solving k problems separately (i.e. you should be better than k times the

complexity you found in the previous problem). Again, what is the total computational complexity of

computing all the solutions, in terms of k, n, and d? Again, take into to account the construction time

of computing all matrices and vectors used in computing the (exact) least squares solution.

3. (7 points) Run this multiclass algorithm on all 10 classes (with appropriate regularization if needed).

On the training set, what is your 0/1 loss and what is your square loss? On the test set, what is your

0/1 loss and what is your square loss?

1.2 Neural Nets with a random first layer: Using more features [5 points]

Now let us try to make “better” features; we are not going to be particularly clever in the way we make

these features, though they do provide non-trivial improvements. Let x be an image (as a vector in R

d

).

Now we will map each x to a k-dimensional feature vector as follows:

x → (h1(x), h2(x), . . . hk(x))

We will use k = 10000 here (k is ten thousand). Each hj will be of the form:

hj (x) = max{vj · x, 0}

2

where vj ∈ Rd

. To construct all these vectors v1, v2, . . . vk, you will independently sample every coordinate

for every vector from a standard normal distribution. (We can think of this like a one-layer neural network

where our first layer has random weights, and we are not changing them).

In particular, the square loss in this case is:

L(w) = 1

N

X

N

i=1

y

(i) −

X

k

j=1

wjhj (x

(i)

)

2

where the sum is over the N points in our training set, {(x

(1), y(1)),(x

(2), y(2)), . . .(x

(N)

, y(N)

)}, and where

hj (x) is the j-th (random) feature function we have made. Use appropriate regularization.

Keep these vectors v1, . . . vk around as you will use them for the next HW problem.

1. (8 points) Once you have these k-dimensional features, run our multiclass algorithm on all 10 classes

(with appropriate regularization if needed). On the training set, what is your 0/1 loss and what is

your square loss? On the test set, what is your 0/1 loss and what is your square loss?

2 Programming Question: Multi-Class Classification using Logistic

Regression and Softmax [50 Points]

We now turn to logistic regression and the softmax classifier for MNIST.

Throughout this question use appropriate regularization as needed.

2.1 Binary Logistic Regression [10 Points]

Let us again consider the classification problem of recognizing if a digit is a 2 or not, except now let us use

logistic regression. Here, let Y = 1 for all the 2’s digits in the training set, and use Y = 0 for all other digits.

Build a logistic classifier by minimizing the log loss. Use appropriate regularization as needed. Also please

use an (unregularized) offset weight w0. For optimization, use batch gradient descent. Run until you feel

you are near to convergence.

1. (2 points) What learning rate did you use?

2. (4 points) For both the training set and the test, plot the log loss as function of the iteration number

(and show both curves on the same plot). Note that you are only optimizing on the training set.

3. (4 points) For the 0/1 use a threshold of 0.5 for classifying if the digit is a 2 (based on the training

set). On the training set and test set, what are your final log losses and your 0/1 losses?

2.2 Softmax classification: gradient descent [15 Points]

Let us do multi-class classification. Rather than doing “one against all” you will use the soft-max classifier.

Here, Y takes values in the set {0, . . . k − 1}. The model is as follows: we have k − 1, weight vectors

w

(1), w(2), . . . w(k−1). For Y 0

Pr(Y = `|x, w) = exp(w

(`)

· x)

1 + Pk−1

i=1 exp(w(i)

· x)

3

and for Y = 0

Pr(Y = `|x, w) = 1

1 + Pk−1

i=1 exp(w(i)

· x)

.

Note that the probabilities sum to 1.

For classification, you will then take the largest predicted score among your 10 predictors. Let us optimize

the log loss with batch gradient descent.

The (negative) likelihood function on an N sampling training set is:

L(w) = −1

N

X

N

i=1

log Pr(Y = y

(i)

|x

(i)

, w)

where the sum is over the N points in our training set.

For classification, one can choose the class with the largest predicted probability.

1. (4 points) Write out the derivative of the log-likelihood of the soft-max function. Write this in a

natural manner only interms of conditional probabilities, the vectors X, and indicator functions. (Do

not write this expression in terms of exponentials). You will use this gradient in subsequent parts of

this problem.

2. (1 points) Implement batch gradient descent. What learning rate did you use?

3. (4 points) For both the training set and the test, plot the log loss as function of the iteration number

(and show both curves on the same plot). Do this for the 0/1 loss as well.

4. (1 points) On the training set and test set, what are your final log losses? your 0/1 losses?

2.3 Softmax classification: stochastic gradient descent [15 Points]

Let us now optimize with stochastic gradient descent (SGD). Let use both SGD with one point at a time,

and let use SGD using a mini-batch size of 100. You may turn down the learning rate if you desire (either

by hand or according to some decay scheme) or you may choose a single learning rate. You should get a

final predictor comparable to that in the previous question.

1. (4 points) Implement SGD, with mini-batch size of 1. After every 15,000 points you have passed

through, for both the training set and the test, plot the log loss as function of the iteration number

(and show both curves on the same plot). Do this for the 0/1 loss on the training and test set as well

(showing both curves on the same plot). These plots should start at iteration 0. Your curves should

eventually drop to the performance comparable to that of batch gradient descent. How many iterations

did this take you? Specify your learning rate? (or specify your learning rate decay scheme if you alter

the learning rate).

2. (1 points) Compare (to batch gradient descent) the total computational complexity it took you to

reach a comparable 0/1-loss on your training set. Note that each iteration of batch gradient descent

costs a factor of n more (where n is the number of data points).

3. (4 points) Implement SGD, with mini-batch size of 100. In other words, rather than estimating the

gradient with one random sample, use 100 samples. Make identical plots of the log loss and the 0/1

loss as before. Your curves should eventually drop to the performance comparable to that of batch

gradient descent. How many iterations did this take you? Specify your learning rate? (or your learning

rate scheme if you adapted it).

4. (1 points) Compare the mini-batch algorithm to that of just using one point at a time in terms of total

computational time.

4

5. (0 points) Aside: in many practical cases, computing a mini-batch gradient estimate with 100 points

can be done efficiently with matrix operations (i.e. it is not 100 slower than computing just one

stochastic gradient with batch size 1).

2.4 Neural Nets with a random first layer: Using more features [10 points]

Let us now use the features we made in Section 1.2.

1. (5 points) Run your favorite algorithm from the previous experiments (for the softmax classifier) on the

our k-dimensional feature vector (with appropriate regularization if needed). Show reasonable plots

for how your log-loss and 0/1 loss evolve on both your training and test errors (as you did above).

Specify what algorithm you used and your learning rate scheme.

2. (5 points) On the training set, what is your 0/1 loss and what is your log loss? On the test set, what

is your 0/1 loss and what is your log loss?

3 (Baby) Learning Theory: Why is statistical learning possible?

[22 points]

Suppose now that you have a finite set of K classifiers. This set of classifiers is sometimes referred to as the

hypothesis space. We will now ask the question of how many samples do you need so that your classification

error rate is close to that of the best classifier among this set.

More broadly, as we get more data, we would expect that we can utilize a richer hypothesis class. Specifically,

we expect that with more data we can fit a more complex model. In this question, we seeks to quantify this

relationship.

3.1 A confidence interval for a coin [2 points]

A simplified version of the Chernoff-Hoeffding bound is as follows: Suppose that Z1, Z2, . . . ZN are i.i.d.

Bernoulli(θ) binary random variables. In other words, P r(Z = 1) = θ and P r(Z = 0) = 1 − θ. In this

setting, it is common to use the terminology that θ is the bias of the coin. Let Z be the empirical average,

i.e. Z =

1

N

PN

i=1 Zi

. Then for any 0,

Pr(|Z − θ| ≥ ) ≤ 2 exp(−2N 2

)

1. (2 points) Provide a confidence interval for θ that holds with probability greater than 1 − δ.

3.2 Estimating the performance of a given classifier [2 points]

Suppose we have a distribution over pairs (X, Y ) where Y is a binary random variable in {0, 1}. Here, X

lives in an arbitrary set X (X need not be a vector). The 0/1 loss for a classifier f : X → {0, 1} is defined

as:

L(f) = EX,Y [1(f(X) 6= Y )]

where 1(E) is the indicator function for the event E (the function takes the value 1 if E occurs and 0

otherwise). The expectation is with respect to the underlying distribution on (X, Y ).

5

1. (2 points) Suppose you have a given classifier f and you seek to estimate the loss of f. You then obtain

a set of samples {(x1, y1),(x2, y2), . . . ,(xn, yn)}. Define an estimator, Lb(f) for the loss of this classifier

and also provide a confidence interval for the loss, L(f), which holds with probability greater than

1 − δ. In other words, we are seeking a bound on |L(f) − Lb(f)| that holds with probability greater

than 1 − δ.

3.3 ERM revisited [20 points]

Let us turn to the standard supervised learning setting. We are provided with an N-sample training set

T = {(x1, y1),(x2, y2), . . . ,(xN , yN )}, sampled i.i.d. from this underlying distribution. Using this training

set, we are interested in finding a classifier, fb, which minimizes the 0/1 loss. Importantly, note that the

function fb depends on T (where T is randomly sampled).

Suppose we have a finte set of classifiers F = {f1, f2, . . . fK} (e.g. the classifiers based on halfspaces, the

class of neural network classifiers, the class of decision trees,…). We often refer to F as the hypothesis space.

Also, suppose the classifier we choose minimizes the loss on the training set, i.e. it is the empirical risk

minimizer (ERM). Specifically, our classifier is defined as follows:

fb= arg minf∈F Lb(f)

where

Lb(f) = 1

N

X

N

i=1

1(f(xi) 6= yi)

(i.e. Lb(f) is an empirical estimate of our loss).

The “best in class” function f

∗

is:

f

∗ = arg minf∈F L(f).

Note that L(f

∗

) is the best loss we could hope to achieve using functions in F.

1. (1 points) Is it the case that previous confidence interval (from the last question, in Section 3.2), when

applied to the loss of f

∗

, holds with probability greater than 1 − δ. Why or why not? (Note that this

question is not asking wether or not you know the quantities in the purported confidence interval, but

it is asking if the confidence interval when applied to f

∗

leads to a valid probabilistic statement).

2. (1 points) Is it the case that previous confidence interval (from the last question, in Section 3.2), when

applied to the loss of fb, holds with probability greater than 1 − δ. Why or why not?

3. (4 points) Provide a confidence interval that simultaneously holds for the losses of all f ∈ F, with

probability of error δ. In other words, provide a value B so that:

Pr(for all f ∈ F, |Lb(f) − L(f)| ≤ B) ≥ 1 − δ

Show your steps. Hint: for events E1, E2, . . . Em, the “union bound” states that Pr(E1 or E2 or Em) ≤

Pk

i=1 Pr(Ek). (This bound is also sometimes referred to as the Bonferoni bound).

4. (3 points) Provide a confidence interval for the loss of fb that holds with probability greater than 1 −δ.

In other words, we are seeking a bound on |L(fb)−Lb(fb)| that holds with probability greater than 1−δ.

5. (4 points) Provide a bound on how close your loss, using the ERM, is to the best possible loss (in the

hypothesis space). Specifically, provide a value R so that the following holds with probability greater

than 1 − δ,

L(fb) − L(f

∗

) ≤ R

The quantity L(fb) − L(f

∗

) is often referred to as the regret, as it is how close you are to the best

possible. Here, R is our regret bound.

6

6. (5 points) Let us understand how large a hypothesis class we can utilize, i.e. how large K can be?

Suppose we know we will be provided with a training set of size N, and, before we look at our training

data, we choose the a hypothesis class F as a function of the sample size N. As we get more data, we

would expect that we can utilize a larger hypothesis class. Let us examine this more quantitatively.

We can think of learning being possible if our regret tends to 0 as N becomes large (with a probability

of error less than 1 − δ). Let us determine if learning is possible in each of the following cases.

(a) (1 points) Suppose K is a constant. Are we able to learn, and, if so, what is our regret as function

of N in order notation?

(b) (1 points) Let’s consider an even large hypothesis class. Suppose K = Np

for some constant p.

Are we able to learn, and, if so, what is our regret as function of N in order notation?

(c) (1 points) Let’s consider being more greedy in our choice of a hypothesis class. Suppose K =

exp(√

N). Are we able to learn, and, if so, what is our regret as function of N in order notation?

(d) (2 points) Let’s consider being even more greedy in our choice of a hypothesis class. Suppose

K = exp(10N). Are we able to learn, and, if so, what is our regret as function of N in order

notation?

Your answers to the above are one interpretation as to why learning complex models is possible (from

a statistical perspective).

3.4 Aside: passing from the finite to the infinite… [0 points]

We have shown that we can learn among finite hypothesis classes that are very large (as a function of N). In

practice, we often use infinite hypothesis classes (e.g. the class of halfspaces, decision trees, neural networks,

etc). Why we can learn even when our hypothesis class is infinite? Our previous arguments can help to

provide intuition.

First, observe that our above bounds could be extremely loose. Suppose that F has K functions which are

all identical. Our bounds are flagrantly loose in this case. Can you spot which step in our argument is

sloppy? We really only have K = 1. More generally, even in cases when F is infinite, it could be the case

that many functions in F are near to being identical, and have nearly identical loss (e.g. similar weight

vectors or parameters tend to provide to classifiers with similar losses). When going from finite F to infinite

F, much of learning theory seeks to find a way to precisely characterize the effective size of F.

Very crudely, for linear models, we can think of the effective size of F as being exponential in the dimension

d (and so the sample size needed to learn is O(d)). If we regularize, this brings down the effective size of F.

More generally, we can think of the effective size of F as being exponential in the VC-dimension of F. One

of the most general and powerful techniques is to provide a fine grained “cover” for F, where this cover is:

(i) finite and (ii) all f ∈ F are near to some function in our cover. We then argue learnability is possible by

arguing that if we can learn on our fine grained (finite) cover then we can learn on the entire infinite class

F.

4 SVMs: Hinge loss and mistake bounds [8 Points]

This question is postponed to HW3.

7