Description
Homework Set 1 CPSC 8420
Ridge Regression
Please show that for arbitrary A ∈ R
n×p
, (AT A + λIp)
−1AT = AT
(AAT + λIn)
−1
, where λ > 0.
Now assume n = 100, please compare the time consumption when p = [10, 100, 1000, 2000] and
plot the results appropriately (e.g. in one figure where X-axis denotes p while Y -axis the time
consumption).
Bias–variance trade-off for k-NN
Assume y = f(x) + where E() = 0, V ar() = σ
2
. Please show that:
Err(x0) = σ
2 +
σ
2
k
+ [f(x0) −
1
k
X
k
l=1
f(xl)]2
, (1)
where xl denotes the nearest neighbour data. Please justify Bias and Variance change when k
increases and explain if necessary.
1
0 50 100 150 200 250 300 5
10
15
20
25
30
35
40
size of training set
MSE
train
test
1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 5
10
15
20
25
30
degree
MSE
train
test
−10 −8 −6 −4 −2 0 2 4 6 8 10 0
10
20
30
40
50
60
70
80
90
log lambda
MSE
train
test
(a) (b) (c)
Figure 1: MSE vs (a) training set size, (b) polynomial degree, (c) size of ridge penalty. Solid Red = training, dotted black = test.
x y
0 (−1, −1)T
0 (−1, −2)T
0 (−2, −1)T
1 (1, 1)T
1 (1, 2)T
1 (2, 1)T
Compute Wˆ from the above data.
4 Linear, polynomial and ridge regression on Bos ton housing data (Matlab)
We will use linear regression to predict house prices, using the famous Boston housing dataset, described at http:
//www.cs.toronto.edu/˜delve/data/boston/bostonDetail.html. There are 506 records. We will
use first 13 features as inputs, x, and the 14th feature, median house price, as the output y.
All features are continuous,
except feature 4, which is binary. However, we will treat this like any other continuous variable.
1. Load the housing.data file. We will use the first 300 cases for training and the remaining 206 cases for
testing. However, the records seem to be sorted in some kind of order. To eliminate this, we will shuffle the data
before splitting into a training/ test set. So we can all compare results, let use the following convention:
Listing 1: :
data = load(’housing.data’);
x = data(:, 1:13);
y = data(:,14);
[n,d] = size(x);
seed = 2; rand(’state’,seed); randn(’state’, seed);
perm = randperm(n); % remove any possible ordering fx
x = x(perm,:); y = y(perm);
Ntrain = 300;
Xtrain = x(1:Ntrain,:); ytrain = y(1:Ntrain);
Xtest = x(Ntrain+1:end,:); ytest = y(Ntrain+1:end);
2. Now extract the first n records of the training data, for n ∈ {25, 50, 75, 100, 150, 200, 300}. For each such
training subset, standardize it, and fit a linear regression model using least squares. (Remember to include
an offset term.)
Then standardize the whole test set in the same way. Compute the mean squared error on
the training subset and on the whole test set. Plot MSE versus training set size. You should get a plot like
Figure 1(a). Turn in your plot and code. Explain why the test error decreases as n increases, and why the train
error increases as n increases. Why do the curves eventually meet?
As a debugging aid, here are the regression weights I get when I train on the first 25 cases (the first term is the
offset, w0):
2
Figure 1: MSE vs (a) training set size, (b) polynomial degree, (c) size of ridge penalty. Solid Red
= training, dotted black = test.
Shrinkage Methods
For vanilla linear regression model: min ky − Aβk
2
2
, we denote the solution as βˆ
LS; for ridge
regression model: min ky − Aβk
2
2 + λ ∗ kβk
2
2
, we denote the solution as βˆRidge
λ
; for Lasso model:
min 1
2
ky−Aβk
2
2+λ∗kβk1, we denote the solution as βˆLasso
λ
; for Subset Selection model: min 1
2
ky−
Aβk
2
2+λ∗kβk0, we denote the solution as βˆSubset
λ
, now please derive each βˆ given y, A(s.t. AT A =
I), λ. Also, show the relationship of (each element in) βˆRidge
λ
, βˆLasso
λ
, βˆSubset
λ with (that in) βˆ
LS
respectively. (you are encouraged to illustrate the relationship with figures appropriately.)
Linear Regression and its extension
In the Boston housing dataset, there are 506 records. We will use first 13 features as inputs, x, and
the 14th feature, median house price, as the output y. All features are continuous, except feature
4, which is binary. However, we will treat this like any other continuous variable.
1. Load the housing.data file. We will use the first 300 cases for training and the remaining 206
cases for testing. However, the records seem to be sorted in some kind of order. To eliminate
this, we will shuffle the data before splitting into a training/test set.
So we can all compare
results, let use the following convention:
data = load(’housing.data’);
x = data(:, 1:13);
y = data(:,14);
[n,d] = size(x);
2
seed = 2; rand(’state’,seed); randn(’state’, seed);
perm = randperm(n); % remove any possible ordering fx
x = x(perm,:); y = y(perm);
Ntrain = 300;
Xtrain = x(1:Ntrain,:); ytrain = y(1:Ntrain);
Xtest = x(Ntrain+1:end,:); ytest = y(Ntrain+1:end);
2. Now extract the first n records of the training data, for n ∈ {25, 50, 75, 100, 150, 200, 300}.
For each such training subset, standardize it (you may use zscore function in Matlab), and
fit a linear regression model using least squares. (Remember to include an offset term.)
Then
standardize the whole test set in the same way. Compute the mean squared error on the
training subset and on the whole test set. Plot MSE versus training set size. You should get
a plot like Figure 1(a). Turn in your plot and code.
Explain why the test error decreases as
n increases, and why the train error increases as n increases. Why do the curves eventually
meet? As a debugging aid, here are the regression weights I get when I train on the first 25
cases (the first term is the offset, w0): [26.11, −0.58, 3.02, . . . , −0.21, −0.27, −1.16].
3. We will now replace the original features with an expanded set of features based on higher
order terms. (We will ignore interaction terms.) For example, a quadratic expansion gives:
x11 x12 . . . x1d
.
.
.
.
.
.
.
.
.
.
.
.
xn1 xn2 . . . xnd
−→
x11 x12 . . . x1d x
2
11 x
2
12 . . . x2
1d
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
xn1 xn2 . . . xnd x
2
n1 x
2
n2
. . . x2
nd
(2)
The provided function degexpand(X,deg,addOnes) will replace each row of X with all powers
up to degree deg. Use this function to train (by least squares) models with degrees 1 to 6.
Use all the the training data. Plot the MSE on the training and test sets vs degree. You
should get a plot like Figure 1(b). Turn in your plot and code. Explain why the test error
decreases and then increases with degree, and why the train error decreases with degree.
4. Now we will use ridge regression to regularize the degree 6 polynomial. Fit models using ridge
regression with the following values for λ:
lambdas = [0 logspace(−10, 10, 10)]
Use all the training data. Plot the MSE on the training and test sets vs log10(λ). You should
get a plot like Figure 1(c). Turn in your plot and code. Explain why the test error goes down
and then up with increasing λ, and why the train error goes up with increasing λ.
5. We turn to Lasso method with objective 1
2
kXβ − yk
2 + λkβk1 where λ varies in:
lambdas = [logspace(−10, 10, 10)]
and we make use of all training samples with no feature expansion. Please plot the changes
of β with λ changes.
Homework Set 2 CPSC 8420
Problem 1
For PCA, from the perspective of maximizing variance, please show that the solution of φ to
maximize kXφk
2
2
, s.t. kφk2 = 1 is exactly the first column of U, where [U, S] = svd(XT X).
(Note:
you need prove why it is optimal than any other reasonable combinations of Ui
, say φˆ = 0.8 ∗ U(:
, 1) + 0.6 ∗ U(:, 2) which also satisfies kφˆk2 = 1.)
Problem 2
Why might we prefer to minimize the sum of absolute residuals instead of the residual sum of
squares for some data sets? Recall clustering method K-means when calculating the centroid, it is
to take the mean value of the data-points belonging to the same cluster, so what about K-medians?
What is its advantage over of K-means? Please use a synthetic (toy) experiment to illustrate your
conclusion.
Problem 3
Let’s revisit Least Squares Problem: minimize
β
1
2
ky − Aβk
2
2
, where A ∈ R
n×p
.
1. Please show that if p > n, then vanilla solution (AT A)
−1AT y is not applicable any more.
2. Let’s assume A = [1, 2, 4; 1, 3, 5; 1, 7, 7; 1, 8, 9], y = [1; 2; 3; 4]. Please show via experiment results that Gradient Descent method will obtain the optimal solution with Linear Convergence
rate if the learning rate is fixed to be 1
σmax(AT A)
, and β0 = [0; 0; 0].
3. Now let’s consider ridge regression: minimize
β
1
2
ky − Aβk
2
2 +
λ
2
kβk
2
2
, where A, y, β0 remains the same as above while learning rate is fixed to be 1
λ+σmax(AT A)
where λ varies from
0.1, 1, 10, 100, 200, please show that Gradient Descent method with larger λ converges faster.
Problem 4
Please download the image from https://en.wikipedia.org/wiki/Lenna#/media/File:Lenna_
(test_image).png with dimension 512 × 512 × 3. Assume for each RGB channel data X, we have
[U, Σ, V ] = svd(X). Please show each compression ratio and reconstruction image if we choose first
2, 5, 20, 50, 80, 100 components respectively.
Also please determine the best component number to
obtain a good trade-off between data compression ratio and reconstruction image quality. (Open
question, that is your solution will be accepted as long as it’s reasonable.)
Homework Set 3 CPSC 8420
Problem 1
Given data-points {{1, 3}, {2, 5}, {3, 4}, {4, 3}, {5, 2}, {5, 1}}.
1. Please scatter-plot each data point within one figure (you can use Matlab, Python or any
other programming language).
2. Now if we want to reduce the dimension from 2 to 1 by PCA, please determine the projection
line which crosses the origin (please plot the line based on the scatter-plot figure above).
3. Assume the first 4 data points belong to one class, while the rest 2 belong to the other. Now
if we want to reduce the dimension from 2 to 1 by LDA, please determine the projection line
which crosses the origin (you are expected to plot the line based on the scatter-plot figure).
Problem 2
Given positive data-set {{1, 1}, {2, 2}, {2, 3}}, as well as negative data-set {{3, 2}, {3, 3}, {4, 4}},
please determine the decision boundary when leveraging k-NN where k = 1 and k = 3 respectively.
Problem 3
Given X, Y, Z, now please follow the idea/method used in LDA/PCA to find the best solution to:
arg max
| {z }
a,b
a
TZb
s.t. aT Xa = 1, bT Y b = 1
(1)
Homework Set 4 CPSC 8420
Problem 1
Considering soft margin SVM, where we have the objective and constraints as follows:
min
1
2
||w||2
2 + C
Xm
i=1
ξi
s.t. yi(w
T xi+b) ≥ 1 − ξi (i = 1, 2, …m)
ξi ≥0 (i = 1, 2, …m)
(1)
Now we formulate another formulation as:
min
1
2
||w||2
2 +
C
2
Xm
i=1
ξ
2
i
s.t. yi(w
T xi+b) ≥ 1 − ξi (i = 1, 2, …m)
(2)
1. Different from Eq. (1), we now drop the non-negative constraint for ξi
, please show that
optimal value of the objective will be the same when ξi constraint is removed.
2. What’s the generalized Lagrangian of the new soft margin SVM optimization problem?
3. Now please minimize the Lagrangian with respect to w, b, and ξ.
4. What is the dual of this version soft margin SVM optimization problem? (should be similar
to Eq. (10) in the slides)
5. Please analysis bias-variance trade-off when C increases.
Problem 2
Recall vanilla SVM objective:
L(w, b, α) = 1
2
||w||2
2 −
Xm
i=1
αi
[yi(w
T xi + b) − 1] s.t. αi ≥ 0 (3)
If we denote the margin as γ, and vector α = [α1, α2, . . . , αm], now please show γ
2 ∗ kαk1 = 1.