# EECS445: Introduction to Machine Learning Homework #1 solution

\$30.00

Original Work
Category:

## Description

1 [32 points] Linear regression on a polynomial
The files q1xTrain.dat, q1xTest.dat, q1yTrain.dat and q1yTest.dat specify a linear regression problem
for a polynomial. q1xTrain.dat represent the inputs (x
(i) ∈ R) and q1yTrain.dat represents the outputs
(t
(i) ∈ R) of the training set, with one training example per row.
(a) [12 points] For each of the following optimization methods, find the coefficients (slope and intercept)
that minimize the error for a first degree polynomial.
• Newton’s method (note that this is the same as closed-form solution we covered in class)
i. [9 points] Give the coefficients generated by each of the three optimization methods.
1
ii. [3 points] How did the methods compare in terms of rate of convergence?
(b) [10 points] Next, you will investigate the problem of over-fitting. Recall the figure from lecture that
explored over-fitting as a function of the degree of the polynomial M, where the Root-Mean-Square
(RMS) Error is define as
ERMS =
p
2E(w∗)/N,
where
E(w) = 1
2
X
N
i=1
(
M
X−1
j=0
wjφj (x
(i)
) − t
(i)
)
2 =
1
2
X
N
i=1
(wT φ(x
(i)
) − t
(i)
)
2
:
1 4
1
7 10
i. [7 points] Using Newton’s method, find the coefficients that minimize the error for a polynomial with
M features (for M = 1 . . . 10) for the training data specified in q1xTrain.dat and q1yTrain.dat.
Now use these parameters to regenerate the above chart, using q1xTest.dat and q1yTest.dat as
the test data.
ii. [3 points] Which degree polynomial would you say best fits the data? Was there evidence of
(c) [10 points] Finally, you will explore the role of regularization. Recall the image from lecture that
explored the affect of the regularization factor λ:
i. [7 points] Using Newton’s method, find the coefficients that minimize the error for a ninth degree
polynomial (M = 10) given regularization factor λ (for λ = {0, 10−6
, 10−5
, . . . , 10−1
, 100
(1)}) for
the training data specified in q1xTrain.dat and q1yTrain.dat. Now use these parameters to plot
the ERMS of both the training data and test data as a function of λ (regenerate the above chart,
using q1xTest.dat and q1yTest.dat as the test data). Specifically, use the following regularized
objective function:
1
2
X
N
i=1
(wT φ(x
(i)
) − t
(i)
)
2 +
λ
2
||w||2
2
.
for optimizing the parameters w, but please use the original (unregularized) ERMS for plotting.
ii. [3 points] Which λ value seemed to work the best?
2
2 [24 points] Locally weighted linear regression
Note: In this problem, we assume that the feature φ(x) is simply the raw input vector x (i.e., φ(x) = x).
Consider a linear regression problem in which we want to weight different training examples differently.
Specifically, suppose we want to minimize
ED(w) = 1
2
X
N
i=1
r
(i)
(wT x
(i) − t
(i)
)
2
.
In class, we worked out what happens for the case where all the weights (the r
(i)
’s) are the same. In this
problem, we will generalize some of those ideas to the weighted setting, and also implement the locally
weighted linear regression algorithm. (Note that the weight r
(i)
can be different for each of the training
example.)
(a) [2 points] Show that ED(w) can also be written
ED(w) = (Φw − t)
T R(Φw − t)
for an appropriate diagonal matrix R, and where Φ and t are as defined in class. State clearly what R
is.
(b) [6 points] If all the r
(i)
’s equal 1, then we saw in class that the normal equation is
Φ
T Φw = ΦT
t,
and that the value of w∗
that minimizes ED(w) is given by (ΦT Φ)−1Φ
T
t. By finding the derivative
∇wED(w) and setting that to zero, generalize the normal equation to this weighted setting, and give
the new value of w∗
that minimizes ED(w) in closed form as a function of Φ, R and t.
(c) [5 points] Suppose we have a training set {(x
(i)
, t(i)
);i = 1 . . . , N} of N independent examples, but in
which the t
(i)
’s were observed with differing variances. Specifically, suppose that
p(t
(i)
|x
(i)
; w) = 1

2πσ(i)
exp(−
(t
(i) − wT x
(i)
)
2
2(σ
(i))
2
)
3
I.e., t
(i) has mean wT x
(i) and variance (σ
(i)
)
2
(where the σ
(i)
’s are fixed, known, constants). Show that
finding the maximum likelihood estimate of w reduces to solving a weighted linear regression problem.
State clearly what the r
(i)
’s are in terms of the σ
(i)
’s.
(d) [11 points] The following items will use the files q2x.dat which contains the inputs (x
(i)
) and q2y.dat
which contains the outputs (t
(i)
) for a linear regression problem, with one training example per row.
i. [2 points] Implement (unweighted) linear regression (t = wT x) on this dataset (using the normal
equations), and plot on the same figure the data and the straight line resulting from your fit.
(Remember to include the intercept term.)
ii. [6 points] Implement locally weighted linear regression on this dataset (using the weighted normal
equations you derived in part (b)), and plot on the same figure the data and the curve resulting
from your fit. When evaluating h(·) at a query point x (which is real-valued in this problem), use
weights
r
(i) = exp 

(x − x
(i)
)
2

2

with a bandwidth parameter τ = 0.8. (Again, remember to include the intercept term.)
iii. [3 points] Repeat (ii) four times with τ = 0.1, 0.3, 2 and 10. Comment briefly on what happens
to the fit when τ is too small or too large.
3 [19 points] k-Nearest Neighbors (kNN)
In kNN classification, a system has access to a labeled dataset: D = {(x
(1), t(1)),(x
(2), t(2)), . . .(x
(N)
, t(N)
)}.
Given an unseen test example xtest, a kNN algorithm will predict the example’s class (ytest) by:
1. Finding the k closest examples to xtest from the dataset, i.e. kNN(xtest) = {(x
0(1), t0(1)),(x
0(2), t0(2)), . . . ,(x
0(k)
, t0(k)
)},
such that x
0(1)
. . . x
0(k)
are the best k points among all training data at minimizing d(x
0
, xtest), where
d(x
0
, x) is a distance measure between x
0 and x.
1
2. Predicting the class label ytest as the mode class of the k Nearest Neighbors. In particular:
ytest = arg max
t
X
(x0
,t0)∈kNN(xtest)
1[t
0 = t]
Break ties either arbitrarily or randomly.
For this question, you will play with a (small) portion of the MNIST dataset (hand-written digits dataset).
Given some test example (a digit that has not been seen), your end goal is to classify it into its correct class
(0 to 9).
Sample data files are available in ctools. Load the file q3 digits.mat which will load matlab variables
digits train, labels train, digits test and labels test. The below is a code snippet that can visualize
an image from the dataset in matlab; for example, you can view the 55th image by:
im = digits_train(55, :, :);
im = reshape(im, 28, 28);
imshow(im);
1A more precise definition of kNN(xtest) would be: if kNN(xtest) = {(x
0(1), t0(1)), (x
0(2), t0(2)), . . . , (x
0(k)
, t0(k)
)}, where
(x
0(i)
, t0(i)
) ∈ D, then for all i, j such that (x
0(i)
, t0(i)
) ∈ kNN(xtest) and (x
0(j)
, t0(j)
) ∈ D\kNN(xtest), d(x
0(i)
, xtest) ≤
d(x
0(j)
, xtest)
4
(a) [9 points] For the first 5 test digits test(1:5, :, :), find the 8-Nearest neighbors. Assume that
d(x, x
0
) is the Euclidean distance between the two examples x and x
0
(calculated on the 28 × 28 = 784
pixels). For each of the 5 test digits, write down the indices of the 8 nearest neighbors and their class
labels. In addition, submit the images of each of the 5 images and its 8 nearest neighbors. The code in
q3 starter.m provides a commented example of how to plot multiple images on the same Matlab figure.
(b) [8 points] Classify all the test images (1000 examples) into their digit class using k = 10. Submit your
code and report the classification accuracy.
[hint: what happens if we have over 100,000 training examples?] How does the value of k effect the
classification accuracy? [hint: run your code using many different values of k and observe how the
accuracy changes.]
(d) [(extra credit) 3 points] What are ways that can improve the accuracy of this kNN classifier? For
One thing you could do is to come up with a different/better distance d(., .) function. It could help to
view the examples that are incorrectly classified]
4 [25 points] Logistic regression
(a) [10 points] Consider the log-likelihood function for logistic regression:
`(w) = X
N
n=1
t
(n)
log h(x
(n)
) + (1 − t
(n)
) log(1 − h(x
(n)
)),
where h(x) = sigmoid(wT x) = 1
1+exp(−wT x)
.
First, find the Hessian H. Specifically, show that the i, jth entry (1 ≤ i, j ≤ M) of the Hessian matrix
can be written as:
Hij = −
X
n
Φnih(x
(n)
)(1 − h(x
(n)
))Φnj . (1)
[Hint: calculate Hij =

2
l(w)
∂wi∂wj
by taking the derivative of l(w) twice with respect to wi and wj .]
In addition, show that the Hessian H is negative semi-definite and thus ` is concave and has no local
maxima other than the global one. That is, show that
z
T Hz ≤ 0
for any vector z. [Hint: You might want to start by showing the fact that P
i
P
j
zixixj zj = (x
T z)
2 ≥ 0.]
(b) [10 points] Using the H you calculated in part (a), write down the update rule implied by Newton’s
method for optimizing `(w). Now use this rule (and not a library function) to implement Newton’s
method and apply it to binary classification problem specified in files q4x.dat and q4y.dat. The two
columns of q4x.dat represent the inputs (x
(i)
) and q4y.dat represents the outputs (t
(i) ∈ {0, 1}), with
one training example per row. Initialize Newtons method with w = 0 (the vector of all zeros). What
are the coefficients w, including the intercept term, resulting from your fit?
(c) [5 points] Plot the training data (your axes should be x1 and x2, corresponding to the two coordinates
of the inputs, and you should use a different symbol for each point plotted to indicate whether that
example had label 1 or 0). Also plot on the same figure the decision boundary fit by logistic regression.
(I.e., this should be a straight line showing the boundary separating the region where h(x) > 0.5 from
where h(x) ≤ 0.5.)
5
Helpful commands: plot, hold on, hold off, figure, close. You can always find detailed information about commands by typing help hcommandi in the command window in Matlab.
Example Code:
%Plot data points in different classes:
q4xClass0 = q4x(q4y==0,:);
q4xClass1 = q4x(q4y==1,:);
hold on
plot(q4xClass0(:,1),q4xClass0(:,2),?rx?);
plot(q4xClass1(:,1),q4xClass1(:,2),?go?);
hold off
6