# CSE546 Homework 1 solution

\$29.99

Original Work ?
Category: You will Instantly receive a download link for .ZIP solution file upon Payment

## Description

5/5 - (1 vote)

0 Policies
Writing: Please submit your HW as a typed pdf document (not handwritten). It is encouraged you latex
all your work, though you may use another comparable typesetting method.
Coding: You must write your own code. You may use any numerical linear algebra packaged, but you
may not use machine learning libraries (e.g. sklearn, tensorflow) unless otherwise specified.
Collaboration: List the names of all people you collaborated with and for which question(s). Each student
must understand, write, and hand in their own answers. In addition, each student must write and submit
their own code in the programming part of the assignment (we may run your code).
Acknowledgments: We expect you to make an honest effort to solve the problems individually. As we
sometimes reuse problem set questions from previous years, covered by papers and webpages, we expect the
students not to copy, refer to, or look at the solutions in preparing their answers (referring to unauthorized
material is considered a violation of the honor code). Similarly, we expect to not to google directly for
answers (though you are free to google for knowledge about the topic). If you do happen to use other
material, it must be acknowledged here, with a citation on the submitted solution.
1 Probability [5 points]
Let X1 and X2 be independent, continuous random variables uniformly distributed on [0, 1]. Let X =
max(X1, X2). Compute
1. (1 points) E(X).
2. (2 points) V ar(X).
3. (2 points) Cov(X, X1).
2 MLE [5 points]
This question uses a discrete probability distribution known as the Poisson distribution. A discrete random
variable X follows a Poisson distribution with parameter λ if
Pr(X = k) = λ
k
k!
e
−λ
k ∈ {0, 1, 2, . . . }
1
Imagine we have a gumball machine which dispenses a random number of gumballs each time you insert
a quarter. Assume that the number of gumballs dispensed is Poisson distributed (i.i.d) with parameter λ.
Curious and sugar-deprived, you insert 8 quarters one at a time and record the number of gumballs dispensed
on each trial:
Trial 1 2 3 4 5 6 7 8
Gumballs 4 1 3 5 5 1 3 8
Let G = (G1, . . . , Gn) be a random vector where Gi
is the number of gumballs dispensed on trial i:
1. (2 points) Give the log-likelihood function of G given λ.
2. (2 points) Compute the MLE for λ in the general case.
3. (1 point) Compute the MLE for λ using the observed G.
3 Linear Regression and LOOCV [10 points]
In class you learned about using cross validation as a way to estimate the true error of a learning algorithm.
A solution that provides an almost unbiased estimate of this true error is Leave-One-Out Cross Validation
(LOOCV), but it can take a really long time to compute the LOOCV error. In this problem you will derive
an algorithm for efficiently computing the LOOCV error for linear regression using the Hat Matrix.
1
(This
is the cool trick alluded to in the slides!)
Assume that there are n training examples, (X1, y1),(X2, y2), . . . ,(Xn, yn), where each input data point Xi
,
has d real valued features. The goal of regression is to learn to predict yi from Xi
. The linear regression
model assumes that the output y is a linear combination of the input features plus Gaussian noise with
weights given by w.
We can write this in matrix form by stacking the data points as the rows of a matrix X so that x
(j)
i
is the
i-th feature of the j-th data point. Then writing Y , w and  as column vectors, we can write the matrix
form of the linear regression model as:
Y = Xw + 
where:
Y =

y1
y2
.
.
.
yn

, X =

x
(1)
1 x
(1)
2
. . . x
(1)
d
x
(2)
1 x
(2)
2
. . . x
(2)
d
.
.
.
.
.
.
.
.
.
.
.
.
x
(n)
1 x
(n)
2
. . . x
(n)
d

, w =

w1
w2
.
.
.
wd

, and  =

1
2
.
.
.
n

Assume that i
is normally distributed with variance σ
2
. We saw in class that the maximum likelihood
estimate of the model parameters w (which also happens to minimize the sum of squared prediction errors)
is given by the Normal equation:
wˆ = (XT X)
−1XT Y
Define Yˆ to be the vector of predictions using ˆw if we were to plug in the original training set X:
Yˆ = Xwˆ
= X(XT X)
−1XT Y
= HY
1Unfortunately, such an efficient algorithm may not be easily found for other learning methods.
2
where we define H = X(XT X)
−1XT
(H is often called the Hat Matrix ).
As mentioned above, ˆw, also minimizes the sum of squared errors:
SSE = Xn
i=1
(yi − yˆi)
2
Now recall that the Leave-One-Out Cross Validation score is defined to be:
LOOCV = Xn
i=1
(yi − yˆ
(−i)
i
)
2
where Yˆ (−i)
is the estimator of Y after removing the i-th observation (i.e., it minimizes P
j6=i
(yj − yˆ
(−i)
j
)
2
).
1. (2 points) What is the time complexity of computing the LOOCV score naively? (The naive algorithm
is to loop through each point, performing a regression on the n−1 remaining points at each iteration.)
Hint: The complexity of matrix inversion is O(k
3
) for a k × k matrix 2
.
2. (1 point) Write ˆyi
in terms of H and Y .
3. (2 points) Show that Yˆ (−i)
is also the estimator which minimizes SSE for Z where
Zj =

yj , j 6= i

(−i)
i
, j = i
4. (1 point) Write ˆy
(−i)
i
in terms of H and Z. By definition, ˆy
(−i)
i = Zi
, but give an answer that is
analogous to 2.
5. (2 points) Show that ˆyi − yˆ
(−i)
i = Hiiyi − Hiiyˆ
(−i)
i
, where Hii denotes the i-th element along the
diagonal of H.
6. (2 points) Show that
LOOCV =
Xn
i=1

yi − yˆi
1 − Hii 2
What is the algorithmic complexity of computing the LOOCV score using this formula?
Note: We see from this formula that the diagonal elements of H somehow indicate the impact that
each particular observation has on the result of the regression.
4 Regularization Constants [16 points]
We have discussed the importance of regularization as a technique to avoid overfitting our models. For linear
regression, we have mentioned both LASSO (which uses the L1 norm as a penalty), and ridge regression
(which uses the squared L2 norm as a penalty). In practice, the scaling factor of these penalties has a
significant impact on the behavior of these methods, and must often be chosen empirically for a particular
dataset. In this problem, we look at what happens when we choose our regularization factor poorly.
For the following, recall that the loss function to be optimized under ridge regression is
ER =
Xn
i=1
(yi − ( ˆw0 + x
(j)wˆ))2 + λkwˆk
2
2
2There are faster algorithms out there but for simplicity we’ll assume that we are using the naive O(k
3
) algorithm.
3
where
λkwˆk
2
2 = λ
X
d
i=1
( ˆwi)
2
(1)
and λ is our regularization constant.
The loss function to be optimized under LASSO regression is
EL =
Xn
i=1
(yi − ( ˆw0 + x
(j)wˆ))2 + λkwˆk1
where
λkwˆk1 = λ
X
d
i=1
|wˆi
|. (2)
1. (5 points) Discuss briefly how choosing too small a λ affects the magnitude of the following quantities.
Please describe the effects for both ridge and LASSO, or state why the effects will be the same.
(a) The error on the training set.
(b) The error on the testing set.
(c) The elements of ˆw.
(d) The number of nonzero elements of ˆw.
2. (5 points) Now discuss briefly how choosing too large a λ affects the magnitude of the same quantities
in the previous question. Again describe the effects for both ridge and LASSO, or state why the effects
will be the same.
5 Regression and Agnostic Learning [20 points]
5.1 Bayes Optimal Prediction
Suppose we have a distribution over pairs (X, Y ) where Y is a scalar. Here, X lives in an arbitrary set X
(X need not be a vector). The square loss for a function f : X → R can be defined as:
L(f) = EX,Y [(Y − f(X))2
]
where the expectation is with respect to the underlying distribution on (X, Y ).
1. (5 points) Show that:
L(f) = EX[(E[Y |X] − f(X))2
] + EX,Y [(Y − E[Y |X])2
]
where E[Y |X] is the conditional expectation of Y given X. Show your steps and take care are in using
conditional expectations.
2. (3 points) The Bayes optimal regressor fBayes is the function which minimizes L(f) over the class of
all functions. Provide a natural expression for fBayes, and write down L(fBayes).
4
5.2 Bias-Variance-Noise Error Decomposition
Given a training set T = {(x1, y1),(x2, y2), . . . ,(xn, yn)} sampled from this underlying distribution, the
problem of regression refers to estimating a function fb which minimizes the square loss. Importantly, note
that the function fb depends on T (where T is randomly sampled).
Now let suppose we have a class of functions F = {f} (e.g. the class of linear functions, the class of neural
networks, the class of decision trees,…). Suppose that the estimator we provide, using T , lives in this class,
i.e. fb∈ F. For example, we could use the empirical risk minimizer (ERM), which is defined as:
ˆf = arg minf∈F
1
N
X
N
i=1
(yi − f(xi))2
(i.e. the ERM minimizes the square loss on the training set). More generally, this problem considers any
estimation procedure restricted to provide an estimate in F, i.e. fb∈ F.
The “best in class” function f

is:
f
∗ = arg minf∈F L(f).
Note that L(f

) is the best loss we could hope to achieve using functions in F.
1. (5 points) Let f(X) = ET [fb(X)] where the expectation is taken with respect the randomness in our
training set, i.e. f(·) is the expected function with respect to the randomness in our training set. Note
that ET L(fb) is the expected square loss of our estimation procedure, where the expectation is with
respect to our training set.
Show that:
ET L(fb) = ET EX,Y [(Y − fb(X))2
]
= ET EX[(f(X) − fb(X))2
] + EX[(f(X) − E[Y |X])2
] + EX,Y [(Y − E[Y |X])2
]
(the first equality simply follows by definition). Note that the expectations in the above terms are
with respect to different quantities. Again, show your steps and take care in manipulating conditional
expectations.
2. (3 points) Interpret these terms as the variance, the bias, and the noise, respectively.
3. (1 points) For the ERM, is it the case that f
∗ = f?
4. (3 points) Suppose that F is finite. For the ERM, in the limit of a large sample size N, what do we
you think f and fb tend to? Why? (Hint: note that fb is determined by the training set T , which is
random. Formally, we are asking if fb converges to some function, with probability that tends to 1,
and, if so, what is this function?)
6 Programming Question 1: Classification using Least Squares
[10 points]
Obtain the MNist dataset from http://yann.lecun.com/exdb/mnist/.
6.1 Binary Classification with Regression
In binary classification, we restrict Y to take on only two values. Suppose Y ∈ 0, 1. Now let us use least
squares linear regression for classification. Let us consider the classification problem of recognizing if a digit
5
is 2 or not using linear regression. Here, let Y = 1 for all the 2’s digits in the training set, and use Y = 0 for
all other digits.
Build a linear classifier by minimizing:
min
w
1
N
X
N
i=1
(yi − w · xi)
2 + λkwk
2
using the training set training set {(x1, y1),(x2, y2), . . . ,(xn, yn)}. Use appropriate regularization as needed.
Let xi be a vector of the pixel values of the image.
For the purpose of classification, we can label a digit as a 2 if w · x is larger than some threshold.
1. (4 points) Based on your training set, choose a reasonable threshold for classification. What is your
0/1 loss on the training set? What is your square loss on the training set?
2. (4 points) On the test set, what is your 0/1 loss? What is your square loss?
3. (1 point) Why might linear regression be a poor idea for classification?
4. (1 point) Recall your answer for what the Bayes optimal predictor is for the square loss. What is the
Bayes optimal predictor using the square loss for classification? State your answer in a natural form,
in terms of a probability.
7 Programming Question 2: Lasso [40 points]
The Lasso is the problem of solving
arg minw,w0
X
i
(Xiw + w0 − yi)
2 + λ
X
j
|wj | (3)
Here X is an N × d matrix of data, and Xi
is the i-th row of the matrix. y is an N × 1 vector of response
variables, w is a d dimensional weight vector, w0 is a scalar offset term, and λ is a regularization tuning
parameter. For the programming part of this homework, you are required to implement the coordinate
descent method that can solve the Lasso problem.
This question contains three parts, 1) analyze and optimize the time complexity 2) test your code using toy
examples 3) try your code on a real world dataset. The first part involves no programming, but is crucial
for writing an efficient algorithm.
7.1 Time complexity and making your code fast [14 points]
In class, you derived the update rules for coordinate descent, which is shown in Algorithm 1 (including the
update term for w0). In this part of question, we will analyze the algorithm and discuss how to make it fast.
There are two key points: utilizing sparsity and caching your predictions. Assume we are using a sparse
matrix representation, so that a dot product takes time proportional to the number of non-zero entries. In
the following questions, your answers should take advantage of the sparsity of X when possible.
1. (1 point) Define ˆyi = Xiw + w0. Simplify the update rules for w0 and the computation for ck in
Algorithm 1 using ˆy. (Hint, there should no longer be a sum over j).
2. (1 point) Let kXk0
be the number of nonzero entries in X. What is the time complexity to compute
ˆy?
6
Algorithm 1: Coordinate Descent Algorithm for Lasso
while not converged do
w0 ←
PN
i=1 
yi −
P
j wjXij
/N
for k ∈ {1, 2, · · · d} do
ak ← 2
PN
i=1 X2
ik
ck ← 2
PN
i=1 Xik 
yi − (w0 +
P
j6=k wjXij )

wk ←