## Description

## Ridge Regression

Please show that for arbitrary A ∈ R

n×p

, (AT A + λIp)

−1AT = AT

(AAT + λIn)

−1

, where λ > 0.

Now assume n = 100, please compare the time consumption when p = [10, 100, 1000, 2000] and

plot the results appropriately (e.g. in one figure where X-axis denotes p while Y -axis the time

consumption).

## Bias–variance trade-off for k-NN

Assume y = f(x) + where E() = 0, V ar() = σ

2

. Please show that:

Err(x0) = σ

2 +

σ

2

k

+ [f(x0) −

1

k

X

k

l=1

f(xl)]2

, (1)

where xl denotes the nearest neighbour data. Please justify Bias and Variance change when k

increases and explain if necessary.

1

0 50 100 150 200 250 300 5

10

15

20

25

30

35

40

size of training set

MSE

train

test

1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 5

10

15

20

25

30

degree

MSE

train

test

−10 −8 −6 −4 −2 0 2 4 6 8 10 0

10

20

30

40

50

60

70

80

90

log lambda

MSE

train

test

(a) (b) (c)

Figure 1: MSE vs (a) training set size, (b) polynomial degree, (c) size of ridge penalty. Solid Red = training, dotted black = test.

x y

0 (−1, −1)T

0 (−1, −2)T

0 (−2, −1)T

1 (1, 1)T

1 (1, 2)T

1 (2, 1)T

Compute Wˆ from the above data.

4 Linear, polynomial and ridge regression on Bos ton housing data (Matlab)

We will use linear regression to predict house prices, using the famous Boston housing dataset, described at http:

//www.cs.toronto.edu/˜delve/data/boston/bostonDetail.html. There are 506 records. We will

use first 13 features as inputs, x, and the 14th feature, median house price, as the output y.

All features are continuous,

except feature 4, which is binary. However, we will treat this like any other continuous variable.

1. Load the housing.data file. We will use the first 300 cases for training and the remaining 206 cases for

testing. However, the records seem to be sorted in some kind of order. To eliminate this, we will shuffle the data

before splitting into a training/ test set. So we can all compare results, let use the following convention:

Listing 1: :

data = load(’housing.data’);

x = data(:, 1:13);

y = data(:,14);

[n,d] = size(x);

seed = 2; rand(’state’,seed); randn(’state’, seed);

perm = randperm(n); % remove any possible ordering fx

x = x(perm,:); y = y(perm);

Ntrain = 300;

Xtrain = x(1:Ntrain,:); ytrain = y(1:Ntrain);

Xtest = x(Ntrain+1:end,:); ytest = y(Ntrain+1:end);

2. Now extract the first n records of the training data, for n ∈ {25, 50, 75, 100, 150, 200, 300}. For each such

training subset, standardize it, and fit a linear regression model using least squares. (Remember to include

an offset term.)

Then standardize the whole test set in the same way. Compute the mean squared error on

the training subset and on the whole test set. Plot MSE versus training set size. You should get a plot like

Figure 1(a). Turn in your plot and code. Explain why the test error decreases as n increases, and why the train

error increases as n increases. Why do the curves eventually meet?

As a debugging aid, here are the regression weights I get when I train on the first 25 cases (the first term is the

offset, w0):

2

Figure 1: MSE vs (a) training set size, (b) polynomial degree, (c) size of ridge penalty. Solid Red

= training, dotted black = test.

## Shrinkage Methods

For vanilla linear regression model: min ky − Aβk

2

2

, we denote the solution as βˆ

LS; for ridge

regression model: min ky − Aβk

2

2 + λ ∗ kβk

2

2

, we denote the solution as βˆRidge

λ

; for Lasso model:

min 1

2

ky−Aβk

2

2+λ∗kβk1, we denote the solution as βˆLasso

λ

; for Subset Selection model: min 1

2

ky−

Aβk

2

2+λ∗kβk0, we denote the solution as βˆSubset

λ

, now please derive each βˆ given y, A(s.t. AT A =

I), λ. Also, show the relationship of (each element in) βˆRidge

λ

, βˆLasso

λ

, βˆSubset

λ with (that in) βˆ

LS

respectively. (you are encouraged to illustrate the relationship with figures appropriately.)

## Linear Regression and its extension

In the Boston housing dataset, there are 506 records. We will use first 13 features as inputs, x, and

the 14th feature, median house price, as the output y. All features are continuous, except feature

4, which is binary. However, we will treat this like any other continuous variable.

1. Load the housing.data file. We will use the first 300 cases for training and the remaining 206

cases for testing. However, the records seem to be sorted in some kind of order. To eliminate

this, we will shuffle the data before splitting into a training/test set.

So we can all compare

results, let use the following convention:

data = load(’housing.data’);

x = data(:, 1:13);

y = data(:,14);

[n,d] = size(x);

2

seed = 2; rand(’state’,seed); randn(’state’, seed);

perm = randperm(n); % remove any possible ordering fx

x = x(perm,:); y = y(perm);

Ntrain = 300;

Xtrain = x(1:Ntrain,:); ytrain = y(1:Ntrain);

Xtest = x(Ntrain+1:end,:); ytest = y(Ntrain+1:end);

2. Now extract the first n records of the training data, for n ∈ {25, 50, 75, 100, 150, 200, 300}.

For each such training subset, standardize it (you may use zscore function in Matlab), and

fit a linear regression model using least squares. (Remember to include an offset term.)

Then

standardize the whole test set in the same way. Compute the mean squared error on the

training subset and on the whole test set. Plot MSE versus training set size. You should get

a plot like Figure 1(a). Turn in your plot and code.

Explain why the test error decreases as

n increases, and why the train error increases as n increases. Why do the curves eventually

meet? As a debugging aid, here are the regression weights I get when I train on the first 25

cases (the first term is the offset, w0): [26.11, −0.58, 3.02, . . . , −0.21, −0.27, −1.16].

3. We will now replace the original features with an expanded set of features based on higher

order terms. (We will ignore interaction terms.) For example, a quadratic expansion gives:

x11 x12 . . . x1d

.

.

.

.

.

.

.

.

.

.

.

.

xn1 xn2 . . . xnd

−→

x11 x12 . . . x1d x

2

11 x

2

12 . . . x2

1d

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

xn1 xn2 . . . xnd x

2

n1 x

2

n2

. . . x2

nd

(2)

The provided function degexpand(X,deg,addOnes) will replace each row of X with all powers

up to degree deg. Use this function to train (by least squares) models with degrees 1 to 6.

Use all the the training data. Plot the MSE on the training and test sets vs degree. You

should get a plot like Figure 1(b). Turn in your plot and code. Explain why the test error

decreases and then increases with degree, and why the train error decreases with degree.

4. Now we will use ridge regression to regularize the degree 6 polynomial. Fit models using ridge

regression with the following values for λ:

lambdas = [0 logspace(−10, 10, 10)]

Use all the training data. Plot the MSE on the training and test sets vs log10(λ). You should

get a plot like Figure 1(c). Turn in your plot and code. Explain why the test error goes down

and then up with increasing λ, and why the train error goes up with increasing λ.

5. We turn to Lasso method with objective 1

2

kXβ − yk

2 + λkβk1 where λ varies in:

lambdas = [logspace(−10, 10, 10)]

and we make use of all training samples with no feature expansion. Please plot the changes

of β with λ changes.