## Description

1 [32 points] Linear regression on a polynomial

The files q1xTrain.dat, q1xTest.dat, q1yTrain.dat and q1yTest.dat specify a linear regression problem

for a polynomial. q1xTrain.dat represent the inputs (x

(i) ∈ R) and q1yTrain.dat represents the outputs

(t

(i) ∈ R) of the training set, with one training example per row.

(a) [12 points] For each of the following optimization methods, find the coefficients (slope and intercept)

that minimize the error for a first degree polynomial.

• Batch gradient descent

• Stochastic gradient descent

• Newton’s method (note that this is the same as closed-form solution we covered in class)

i. [9 points] Give the coefficients generated by each of the three optimization methods.

1

ii. [3 points] How did the methods compare in terms of rate of convergence?

(b) [10 points] Next, you will investigate the problem of over-fitting. Recall the figure from lecture that

explored over-fitting as a function of the degree of the polynomial M, where the Root-Mean-Square

(RMS) Error is define as

ERMS =

p

2E(w∗)/N,

where

E(w) = 1

2

X

N

i=1

(

M

X−1

j=0

wjφj (x

(i)

) − t

(i)

)

2 =

1

2

X

N

i=1

(wT φ(x

(i)

) − t

(i)

)

2

:

1 4

1

7 10

i. [7 points] Using Newton’s method, find the coefficients that minimize the error for a polynomial with

M features (for M = 1 . . . 10) for the training data specified in q1xTrain.dat and q1yTrain.dat.

Now use these parameters to regenerate the above chart, using q1xTest.dat and q1yTest.dat as

the test data.

ii. [3 points] Which degree polynomial would you say best fits the data? Was there evidence of

under/over-fitting the data? Use your generated charts to defend your answer.

(c) [10 points] Finally, you will explore the role of regularization. Recall the image from lecture that

explored the affect of the regularization factor λ:

i. [7 points] Using Newton’s method, find the coefficients that minimize the error for a ninth degree

polynomial (M = 10) given regularization factor λ (for λ = {0, 10−6

, 10−5

, . . . , 10−1

, 100

(1)}) for

the training data specified in q1xTrain.dat and q1yTrain.dat. Now use these parameters to plot

the ERMS of both the training data and test data as a function of λ (regenerate the above chart,

using q1xTest.dat and q1yTest.dat as the test data). Specifically, use the following regularized

objective function:

1

2

X

N

i=1

(wT φ(x

(i)

) − t

(i)

)

2 +

λ

2

||w||2

2

.

for optimizing the parameters w, but please use the original (unregularized) ERMS for plotting.

ii. [3 points] Which λ value seemed to work the best?

2

2 [24 points] Locally weighted linear regression

Note: In this problem, we assume that the feature φ(x) is simply the raw input vector x (i.e., φ(x) = x).

Consider a linear regression problem in which we want to weight different training examples differently.

Specifically, suppose we want to minimize

ED(w) = 1

2

X

N

i=1

r

(i)

(wT x

(i) − t

(i)

)

2

.

In class, we worked out what happens for the case where all the weights (the r

(i)

’s) are the same. In this

problem, we will generalize some of those ideas to the weighted setting, and also implement the locally

weighted linear regression algorithm. (Note that the weight r

(i)

can be different for each of the training

example.)

(a) [2 points] Show that ED(w) can also be written

ED(w) = (Φw − t)

T R(Φw − t)

for an appropriate diagonal matrix R, and where Φ and t are as defined in class. State clearly what R

is.

(b) [6 points] If all the r

(i)

’s equal 1, then we saw in class that the normal equation is

Φ

T Φw = ΦT

t,

and that the value of w∗

that minimizes ED(w) is given by (ΦT Φ)−1Φ

T

t. By finding the derivative

∇wED(w) and setting that to zero, generalize the normal equation to this weighted setting, and give

the new value of w∗

that minimizes ED(w) in closed form as a function of Φ, R and t.

(c) [5 points] Suppose we have a training set {(x

(i)

, t(i)

);i = 1 . . . , N} of N independent examples, but in

which the t

(i)

’s were observed with differing variances. Specifically, suppose that

p(t

(i)

|x

(i)

; w) = 1

√

2πσ(i)

exp(−

(t

(i) − wT x

(i)

)

2

2(σ

(i))

2

)

3

I.e., t

(i) has mean wT x

(i) and variance (σ

(i)

)

2

(where the σ

(i)

’s are fixed, known, constants). Show that

finding the maximum likelihood estimate of w reduces to solving a weighted linear regression problem.

State clearly what the r

(i)

’s are in terms of the σ

(i)

’s.

(d) [11 points] The following items will use the files q2x.dat which contains the inputs (x

(i)

) and q2y.dat

which contains the outputs (t

(i)

) for a linear regression problem, with one training example per row.

i. [2 points] Implement (unweighted) linear regression (t = wT x) on this dataset (using the normal

equations), and plot on the same figure the data and the straight line resulting from your fit.

(Remember to include the intercept term.)

ii. [6 points] Implement locally weighted linear regression on this dataset (using the weighted normal

equations you derived in part (b)), and plot on the same figure the data and the curve resulting

from your fit. When evaluating h(·) at a query point x (which is real-valued in this problem), use

weights

r

(i) = exp

−

(x − x

(i)

)

2

2τ

2

with a bandwidth parameter τ = 0.8. (Again, remember to include the intercept term.)

iii. [3 points] Repeat (ii) four times with τ = 0.1, 0.3, 2 and 10. Comment briefly on what happens

to the fit when τ is too small or too large.

3 [19 points] k-Nearest Neighbors (kNN)

In kNN classification, a system has access to a labeled dataset: D = {(x

(1), t(1)),(x

(2), t(2)), . . .(x

(N)

, t(N)

)}.

Given an unseen test example xtest, a kNN algorithm will predict the example’s class (ytest) by:

1. Finding the k closest examples to xtest from the dataset, i.e. kNN(xtest) = {(x

0(1), t0(1)),(x

0(2), t0(2)), . . . ,(x

0(k)

, t0(k)

)},

such that x

0(1)

. . . x

0(k)

are the best k points among all training data at minimizing d(x

0

, xtest), where

d(x

0

, x) is a distance measure between x

0 and x.

1

2. Predicting the class label ytest as the mode class of the k Nearest Neighbors. In particular:

ytest = arg max

t

X

(x0

,t0)∈kNN(xtest)

1[t

0 = t]

Break ties either arbitrarily or randomly.

For this question, you will play with a (small) portion of the MNIST dataset (hand-written digits dataset).

Given some test example (a digit that has not been seen), your end goal is to classify it into its correct class

(0 to 9).

Sample data files are available in ctools. Load the file q3 digits.mat which will load matlab variables

digits train, labels train, digits test and labels test. The below is a code snippet that can visualize

an image from the dataset in matlab; for example, you can view the 55th image by:

im = digits_train(55, :, :);

im = reshape(im, 28, 28);

imshow(im);

1A more precise definition of kNN(xtest) would be: if kNN(xtest) = {(x

0(1), t0(1)), (x

0(2), t0(2)), . . . , (x

0(k)

, t0(k)

)}, where

(x

0(i)

, t0(i)

) ∈ D, then for all i, j such that (x

0(i)

, t0(i)

) ∈ kNN(xtest) and (x

0(j)

, t0(j)

) ∈ D\kNN(xtest), d(x

0(i)

, xtest) ≤

d(x

0(j)

, xtest)

4

(a) [9 points] For the first 5 test digits test(1:5, :, :), find the 8-Nearest neighbors. Assume that

d(x, x

0

) is the Euclidean distance between the two examples x and x

0

(calculated on the 28 × 28 = 784

pixels). For each of the 5 test digits, write down the indices of the 8 nearest neighbors and their class

labels. In addition, submit the images of each of the 5 images and its 8 nearest neighbors. The code in

q3 starter.m provides a commented example of how to plot multiple images on the same Matlab figure.

(b) [8 points] Classify all the test images (1000 examples) into their digit class using k = 10. Submit your

code and report the classification accuracy.

(c) [2 points] Based on your implementation of kNN, what are advantages and disadvantages of kNN?

[hint: what happens if we have over 100,000 training examples?] How does the value of k effect the

classification accuracy? [hint: run your code using many different values of k and observe how the

accuracy changes.]

(d) [(extra credit) 3 points] What are ways that can improve the accuracy of this kNN classifier? For

full credits, implement your ideas in code, report your improved accuracy and submit your code. [hint:

One thing you could do is to come up with a different/better distance d(., .) function. It could help to

view the examples that are incorrectly classified]

4 [25 points] Logistic regression

(a) [10 points] Consider the log-likelihood function for logistic regression:

`(w) = X

N

n=1

t

(n)

log h(x

(n)

) + (1 − t

(n)

) log(1 − h(x

(n)

)),

where h(x) = sigmoid(wT x) = 1

1+exp(−wT x)

.

First, find the Hessian H. Specifically, show that the i, jth entry (1 ≤ i, j ≤ M) of the Hessian matrix

can be written as:

Hij = −

X

n

Φnih(x

(n)

)(1 − h(x

(n)

))Φnj . (1)

[Hint: calculate Hij =

∂

2

l(w)

∂wi∂wj

by taking the derivative of l(w) twice with respect to wi and wj .]

In addition, show that the Hessian H is negative semi-definite and thus ` is concave and has no local

maxima other than the global one. That is, show that

z

T Hz ≤ 0

for any vector z. [Hint: You might want to start by showing the fact that P

i

P

j

zixixj zj = (x

T z)

2 ≥ 0.]

(b) [10 points] Using the H you calculated in part (a), write down the update rule implied by Newton’s

method for optimizing `(w). Now use this rule (and not a library function) to implement Newton’s

method and apply it to binary classification problem specified in files q4x.dat and q4y.dat. The two

columns of q4x.dat represent the inputs (x

(i)

) and q4y.dat represents the outputs (t

(i) ∈ {0, 1}), with

one training example per row. Initialize Newtons method with w = 0 (the vector of all zeros). What

are the coefficients w, including the intercept term, resulting from your fit?

(c) [5 points] Plot the training data (your axes should be x1 and x2, corresponding to the two coordinates

of the inputs, and you should use a different symbol for each point plotted to indicate whether that

example had label 1 or 0). Also plot on the same figure the decision boundary fit by logistic regression.

(I.e., this should be a straight line showing the boundary separating the region where h(x) > 0.5 from

where h(x) ≤ 0.5.)

5

Helpful commands: plot, hold on, hold off, figure, close. You can always find detailed information about commands by typing help hcommandi in the command window in Matlab.

Example Code:

%Plot data points in different classes:

q4xClass0 = q4x(q4y==0,:);

q4xClass1 = q4x(q4y==1,:);

hold on

plot(q4xClass0(:,1),q4xClass0(:,2),?rx?);

plot(q4xClass1(:,1),q4xClass1(:,2),?go?);

hold off

6