## Description

## Homework Set 1 CPSC 8420

## Ridge Regression

Please show that for arbitrary A ∈ R

n×p

, (AT A + λIp)

−1AT = AT

(AAT + λIn)

−1

, where λ > 0.

Now assume n = 100, please compare the time consumption when p = [10, 100, 1000, 2000] and

plot the results appropriately (e.g. in one figure where X-axis denotes p while Y -axis the time

consumption).

## Bias–variance trade-off for k-NN

Assume y = f(x) + where E() = 0, V ar() = σ

2

. Please show that:

Err(x0) = σ

2 +

σ

2

k

+ [f(x0) −

1

k

X

k

l=1

f(xl)]2

, (1)

where xl denotes the nearest neighbour data. Please justify Bias and Variance change when k

increases and explain if necessary.

1

0 50 100 150 200 250 300 5

10

15

20

25

30

35

40

size of training set

MSE

train

test

1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 5

10

15

20

25

30

degree

MSE

train

test

−10 −8 −6 −4 −2 0 2 4 6 8 10 0

10

20

30

40

50

60

70

80

90

log lambda

MSE

train

test

(a) (b) (c)

Figure 1: MSE vs (a) training set size, (b) polynomial degree, (c) size of ridge penalty. Solid Red = training, dotted black = test.

x y

0 (−1, −1)T

0 (−1, −2)T

0 (−2, −1)T

1 (1, 1)T

1 (1, 2)T

1 (2, 1)T

Compute Wˆ from the above data.

4 Linear, polynomial and ridge regression on Bos ton housing data (Matlab)

We will use linear regression to predict house prices, using the famous Boston housing dataset, described at http:

//www.cs.toronto.edu/˜delve/data/boston/bostonDetail.html. There are 506 records. We will

use first 13 features as inputs, x, and the 14th feature, median house price, as the output y.

All features are continuous,

except feature 4, which is binary. However, we will treat this like any other continuous variable.

1. Load the housing.data file. We will use the first 300 cases for training and the remaining 206 cases for

testing. However, the records seem to be sorted in some kind of order. To eliminate this, we will shuffle the data

before splitting into a training/ test set. So we can all compare results, let use the following convention:

Listing 1: :

data = load(’housing.data’);

x = data(:, 1:13);

y = data(:,14);

[n,d] = size(x);

seed = 2; rand(’state’,seed); randn(’state’, seed);

perm = randperm(n); % remove any possible ordering fx

x = x(perm,:); y = y(perm);

Ntrain = 300;

Xtrain = x(1:Ntrain,:); ytrain = y(1:Ntrain);

Xtest = x(Ntrain+1:end,:); ytest = y(Ntrain+1:end);

2. Now extract the first n records of the training data, for n ∈ {25, 50, 75, 100, 150, 200, 300}. For each such

training subset, standardize it, and fit a linear regression model using least squares. (Remember to include

an offset term.)

Then standardize the whole test set in the same way. Compute the mean squared error on

the training subset and on the whole test set. Plot MSE versus training set size. You should get a plot like

Figure 1(a). Turn in your plot and code. Explain why the test error decreases as n increases, and why the train

error increases as n increases. Why do the curves eventually meet?

As a debugging aid, here are the regression weights I get when I train on the first 25 cases (the first term is the

offset, w0):

2

Figure 1: MSE vs (a) training set size, (b) polynomial degree, (c) size of ridge penalty. Solid Red

= training, dotted black = test.

## Shrinkage Methods

For vanilla linear regression model: min ky − Aβk

2

2

, we denote the solution as βˆ

LS; for ridge

regression model: min ky − Aβk

2

2 + λ ∗ kβk

2

2

, we denote the solution as βˆRidge

λ

; for Lasso model:

min 1

2

ky−Aβk

2

2+λ∗kβk1, we denote the solution as βˆLasso

λ

; for Subset Selection model: min 1

2

ky−

Aβk

2

2+λ∗kβk0, we denote the solution as βˆSubset

λ

, now please derive each βˆ given y, A(s.t. AT A =

I), λ. Also, show the relationship of (each element in) βˆRidge

λ

, βˆLasso

λ

, βˆSubset

λ with (that in) βˆ

LS

respectively. (you are encouraged to illustrate the relationship with figures appropriately.)

## Linear Regression and its extension

In the Boston housing dataset, there are 506 records. We will use first 13 features as inputs, x, and

the 14th feature, median house price, as the output y. All features are continuous, except feature

4, which is binary. However, we will treat this like any other continuous variable.

1. Load the housing.data file. We will use the first 300 cases for training and the remaining 206

cases for testing. However, the records seem to be sorted in some kind of order. To eliminate

this, we will shuffle the data before splitting into a training/test set.

So we can all compare

results, let use the following convention:

data = load(’housing.data’);

x = data(:, 1:13);

y = data(:,14);

[n,d] = size(x);

2

seed = 2; rand(’state’,seed); randn(’state’, seed);

perm = randperm(n); % remove any possible ordering fx

x = x(perm,:); y = y(perm);

Ntrain = 300;

Xtrain = x(1:Ntrain,:); ytrain = y(1:Ntrain);

Xtest = x(Ntrain+1:end,:); ytest = y(Ntrain+1:end);

2. Now extract the first n records of the training data, for n ∈ {25, 50, 75, 100, 150, 200, 300}.

For each such training subset, standardize it (you may use zscore function in Matlab), and

fit a linear regression model using least squares. (Remember to include an offset term.)

Then

standardize the whole test set in the same way. Compute the mean squared error on the

training subset and on the whole test set. Plot MSE versus training set size. You should get

a plot like Figure 1(a). Turn in your plot and code.

Explain why the test error decreases as

n increases, and why the train error increases as n increases. Why do the curves eventually

meet? As a debugging aid, here are the regression weights I get when I train on the first 25

cases (the first term is the offset, w0): [26.11, −0.58, 3.02, . . . , −0.21, −0.27, −1.16].

3. We will now replace the original features with an expanded set of features based on higher

order terms. (We will ignore interaction terms.) For example, a quadratic expansion gives:

x11 x12 . . . x1d

.

.

.

.

.

.

.

.

.

.

.

.

xn1 xn2 . . . xnd

−→

x11 x12 . . . x1d x

2

11 x

2

12 . . . x2

1d

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

xn1 xn2 . . . xnd x

2

n1 x

2

n2

. . . x2

nd

(2)

The provided function degexpand(X,deg,addOnes) will replace each row of X with all powers

up to degree deg. Use this function to train (by least squares) models with degrees 1 to 6.

Use all the the training data. Plot the MSE on the training and test sets vs degree. You

should get a plot like Figure 1(b). Turn in your plot and code. Explain why the test error

decreases and then increases with degree, and why the train error decreases with degree.

4. Now we will use ridge regression to regularize the degree 6 polynomial. Fit models using ridge

regression with the following values for λ:

lambdas = [0 logspace(−10, 10, 10)]

Use all the training data. Plot the MSE on the training and test sets vs log10(λ). You should

get a plot like Figure 1(c). Turn in your plot and code. Explain why the test error goes down

and then up with increasing λ, and why the train error goes up with increasing λ.

5. We turn to Lasso method with objective 1

2

kXβ − yk

2 + λkβk1 where λ varies in:

lambdas = [logspace(−10, 10, 10)]

and we make use of all training samples with no feature expansion. Please plot the changes

of β with λ changes.

## Homework Set 2 CPSC 8420

Problem 1

For PCA, from the perspective of maximizing variance, please show that the solution of φ to

maximize kXφk

2

2

, s.t. kφk2 = 1 is exactly the first column of U, where [U, S] = svd(XT X).

(Note:

you need prove why it is optimal than any other reasonable combinations of Ui

, say φˆ = 0.8 ∗ U(:

, 1) + 0.6 ∗ U(:, 2) which also satisfies kφˆk2 = 1.)

Problem 2

Why might we prefer to minimize the sum of absolute residuals instead of the residual sum of

squares for some data sets? Recall clustering method K-means when calculating the centroid, it is

to take the mean value of the data-points belonging to the same cluster, so what about K-medians?

What is its advantage over of K-means? Please use a synthetic (toy) experiment to illustrate your

conclusion.

Problem 3

Let’s revisit Least Squares Problem: minimize

β

1

2

ky − Aβk

2

2

, where A ∈ R

n×p

.

1. Please show that if p > n, then vanilla solution (AT A)

−1AT y is not applicable any more.

2. Let’s assume A = [1, 2, 4; 1, 3, 5; 1, 7, 7; 1, 8, 9], y = [1; 2; 3; 4]. Please show via experiment results that Gradient Descent method will obtain the optimal solution with Linear Convergence

rate if the learning rate is fixed to be 1

σmax(AT A)

, and β0 = [0; 0; 0].

3. Now let’s consider ridge regression: minimize

β

1

2

ky − Aβk

2

2 +

λ

2

kβk

2

2

, where A, y, β0 remains the same as above while learning rate is fixed to be 1

λ+σmax(AT A)

where λ varies from

0.1, 1, 10, 100, 200, please show that Gradient Descent method with larger λ converges faster.

Problem 4

Please download the image from https://en.wikipedia.org/wiki/Lenna#/media/File:Lenna_

(test_image).png with dimension 512 × 512 × 3. Assume for each RGB channel data X, we have

[U, Σ, V ] = svd(X). Please show each compression ratio and reconstruction image if we choose first

2, 5, 20, 50, 80, 100 components respectively.

Also please determine the best component number to

obtain a good trade-off between data compression ratio and reconstruction image quality. (Open

question, that is your solution will be accepted as long as it’s reasonable.)

## Homework Set 3 CPSC 8420

Problem 1

Given data-points {{1, 3}, {2, 5}, {3, 4}, {4, 3}, {5, 2}, {5, 1}}.

1. Please scatter-plot each data point within one figure (you can use Matlab, Python or any

other programming language).

2. Now if we want to reduce the dimension from 2 to 1 by PCA, please determine the projection

line which crosses the origin (please plot the line based on the scatter-plot figure above).

3. Assume the first 4 data points belong to one class, while the rest 2 belong to the other. Now

if we want to reduce the dimension from 2 to 1 by LDA, please determine the projection line

which crosses the origin (you are expected to plot the line based on the scatter-plot figure).

Problem 2

Given positive data-set {{1, 1}, {2, 2}, {2, 3}}, as well as negative data-set {{3, 2}, {3, 3}, {4, 4}},

please determine the decision boundary when leveraging k-NN where k = 1 and k = 3 respectively.

Problem 3

Given X, Y, Z, now please follow the idea/method used in LDA/PCA to find the best solution to:

arg max

| {z }

a,b

a

TZb

s.t. aT Xa = 1, bT Y b = 1

(1)

## Homework Set 4 CPSC 8420

Problem 1

Considering soft margin SVM, where we have the objective and constraints as follows:

min

1

2

||w||2

2 + C

Xm

i=1

ξi

s.t. yi(w

T xi+b) ≥ 1 − ξi (i = 1, 2, …m)

ξi ≥0 (i = 1, 2, …m)

(1)

Now we formulate another formulation as:

min

1

2

||w||2

2 +

C

2

Xm

i=1

ξ

2

i

s.t. yi(w

T xi+b) ≥ 1 − ξi (i = 1, 2, …m)

(2)

1. Different from Eq. (1), we now drop the non-negative constraint for ξi

, please show that

optimal value of the objective will be the same when ξi constraint is removed.

2. What’s the generalized Lagrangian of the new soft margin SVM optimization problem?

3. Now please minimize the Lagrangian with respect to w, b, and ξ.

4. What is the dual of this version soft margin SVM optimization problem? (should be similar

to Eq. (10) in the slides)

5. Please analysis bias-variance trade-off when C increases.

Problem 2

Recall vanilla SVM objective:

L(w, b, α) = 1

2

||w||2

2 −

Xm

i=1

αi

[yi(w

T xi + b) − 1] s.t. αi ≥ 0 (3)

If we denote the margin as γ, and vector α = [α1, α2, . . . , αm], now please show γ

2 ∗ kαk1 = 1.