Description
Ridge Regression
Please show that for arbitrary A ∈ R
n×p
, (AT A + λIp)
−1AT = AT
(AAT + λIn)
−1
, where λ > 0.
Now assume n = 100, please compare the time consumption when p = [10, 100, 1000, 2000] and
plot the results appropriately (e.g. in one figure where X-axis denotes p while Y -axis the time
consumption).
Bias–variance trade-off for k-NN
Assume y = f(x) + where E() = 0, V ar() = σ
2
. Please show that:
Err(x0) = σ
2 +
σ
2
k
+ [f(x0) −
1
k
X
k
l=1
f(xl)]2
, (1)
where xl denotes the nearest neighbour data. Please justify Bias and Variance change when k
increases and explain if necessary.
1
0 50 100 150 200 250 300 5
10
15
20
25
30
35
40
size of training set
MSE
train
test
1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 5
10
15
20
25
30
degree
MSE
train
test
−10 −8 −6 −4 −2 0 2 4 6 8 10 0
10
20
30
40
50
60
70
80
90
log lambda
MSE
train
test
(a) (b) (c)
Figure 1: MSE vs (a) training set size, (b) polynomial degree, (c) size of ridge penalty. Solid Red = training, dotted black = test.
x y
0 (−1, −1)T
0 (−1, −2)T
0 (−2, −1)T
1 (1, 1)T
1 (1, 2)T
1 (2, 1)T
Compute Wˆ from the above data.
4 Linear, polynomial and ridge regression on Bos ton housing data (Matlab)
We will use linear regression to predict house prices, using the famous Boston housing dataset, described at http:
//www.cs.toronto.edu/˜delve/data/boston/bostonDetail.html. There are 506 records. We will
use first 13 features as inputs, x, and the 14th feature, median house price, as the output y.
All features are continuous,
except feature 4, which is binary. However, we will treat this like any other continuous variable.
1. Load the housing.data file. We will use the first 300 cases for training and the remaining 206 cases for
testing. However, the records seem to be sorted in some kind of order. To eliminate this, we will shuffle the data
before splitting into a training/ test set. So we can all compare results, let use the following convention:
Listing 1: :
data = load(’housing.data’);
x = data(:, 1:13);
y = data(:,14);
[n,d] = size(x);
seed = 2; rand(’state’,seed); randn(’state’, seed);
perm = randperm(n); % remove any possible ordering fx
x = x(perm,:); y = y(perm);
Ntrain = 300;
Xtrain = x(1:Ntrain,:); ytrain = y(1:Ntrain);
Xtest = x(Ntrain+1:end,:); ytest = y(Ntrain+1:end);
2. Now extract the first n records of the training data, for n ∈ {25, 50, 75, 100, 150, 200, 300}. For each such
training subset, standardize it, and fit a linear regression model using least squares. (Remember to include
an offset term.)
Then standardize the whole test set in the same way. Compute the mean squared error on
the training subset and on the whole test set. Plot MSE versus training set size. You should get a plot like
Figure 1(a). Turn in your plot and code. Explain why the test error decreases as n increases, and why the train
error increases as n increases. Why do the curves eventually meet?
As a debugging aid, here are the regression weights I get when I train on the first 25 cases (the first term is the
offset, w0):
2
Figure 1: MSE vs (a) training set size, (b) polynomial degree, (c) size of ridge penalty. Solid Red
= training, dotted black = test.
Shrinkage Methods
For vanilla linear regression model: min ky − Aβk
2
2
, we denote the solution as βˆ
LS; for ridge
regression model: min ky − Aβk
2
2 + λ ∗ kβk
2
2
, we denote the solution as βˆRidge
λ
; for Lasso model:
min 1
2
ky−Aβk
2
2+λ∗kβk1, we denote the solution as βˆLasso
λ
; for Subset Selection model: min 1
2
ky−
Aβk
2
2+λ∗kβk0, we denote the solution as βˆSubset
λ
, now please derive each βˆ given y, A(s.t. AT A =
I), λ. Also, show the relationship of (each element in) βˆRidge
λ
, βˆLasso
λ
, βˆSubset
λ with (that in) βˆ
LS
respectively. (you are encouraged to illustrate the relationship with figures appropriately.)
Linear Regression and its extension
In the Boston housing dataset, there are 506 records. We will use first 13 features as inputs, x, and
the 14th feature, median house price, as the output y. All features are continuous, except feature
4, which is binary. However, we will treat this like any other continuous variable.
1. Load the housing.data file. We will use the first 300 cases for training and the remaining 206
cases for testing. However, the records seem to be sorted in some kind of order. To eliminate
this, we will shuffle the data before splitting into a training/test set.
So we can all compare
results, let use the following convention:
data = load(’housing.data’);
x = data(:, 1:13);
y = data(:,14);
[n,d] = size(x);
2
seed = 2; rand(’state’,seed); randn(’state’, seed);
perm = randperm(n); % remove any possible ordering fx
x = x(perm,:); y = y(perm);
Ntrain = 300;
Xtrain = x(1:Ntrain,:); ytrain = y(1:Ntrain);
Xtest = x(Ntrain+1:end,:); ytest = y(Ntrain+1:end);
2. Now extract the first n records of the training data, for n ∈ {25, 50, 75, 100, 150, 200, 300}.
For each such training subset, standardize it (you may use zscore function in Matlab), and
fit a linear regression model using least squares. (Remember to include an offset term.)
Then
standardize the whole test set in the same way. Compute the mean squared error on the
training subset and on the whole test set. Plot MSE versus training set size. You should get
a plot like Figure 1(a). Turn in your plot and code.
Explain why the test error decreases as
n increases, and why the train error increases as n increases. Why do the curves eventually
meet? As a debugging aid, here are the regression weights I get when I train on the first 25
cases (the first term is the offset, w0): [26.11, −0.58, 3.02, . . . , −0.21, −0.27, −1.16].
3. We will now replace the original features with an expanded set of features based on higher
order terms. (We will ignore interaction terms.) For example, a quadratic expansion gives:
x11 x12 . . . x1d
.
.
.
.
.
.
.
.
.
.
.
.
xn1 xn2 . . . xnd
−→
x11 x12 . . . x1d x
2
11 x
2
12 . . . x2
1d
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
xn1 xn2 . . . xnd x
2
n1 x
2
n2
. . . x2
nd
(2)
The provided function degexpand(X,deg,addOnes) will replace each row of X with all powers
up to degree deg. Use this function to train (by least squares) models with degrees 1 to 6.
Use all the the training data. Plot the MSE on the training and test sets vs degree. You
should get a plot like Figure 1(b). Turn in your plot and code. Explain why the test error
decreases and then increases with degree, and why the train error decreases with degree.
4. Now we will use ridge regression to regularize the degree 6 polynomial. Fit models using ridge
regression with the following values for λ:
lambdas = [0 logspace(−10, 10, 10)]
Use all the training data. Plot the MSE on the training and test sets vs log10(λ). You should
get a plot like Figure 1(c). Turn in your plot and code. Explain why the test error goes down
and then up with increasing λ, and why the train error goes up with increasing λ.
5. We turn to Lasso method with objective 1
2
kXβ − yk
2 + λkβk1 where λ varies in:
lambdas = [logspace(−10, 10, 10)]
and we make use of all training samples with no feature expansion. Please plot the changes
of β with λ changes.