Description
In this homework, we will study kernelized forms of ridge regression and PCA.
1 Kernel Ridge Regression
We derive kernel ridge regression in a couple steps.
Step 1. Consider solving the ridge regression problem with a training set
S = {(x1, y1),(x2, y2), . . . ,(xn, yn) ∈ R
p × R}.
Let wˆ λ be the ridge regression solution with regularization parameter λ, i.e.
wˆ λ := arg min
w∈Rp
1
n
Xn
i=1
(yi − hw, xii)
2 + λkwk
2
2
.
In class, we derived a closed form expression for wˆ λ in terms of the design matrix X and response
vector y, defined as
X =
— x
>
1 —
— x
>
2 —
.
.
.
— x
>
n —
and y =
y1
y2
.
.
.
yn
.
This closed form expression turns out to be
wˆ λ = (X>X + nλI)
−1X>y.
We will derive an alternative form for wˆ λ. First, note that the above closed form expression comes
from solving the optimality equation for wˆ λ, i.e.
(X>X + nλIp)wˆ λ = X>y.
where Ip is the identity matrix of order p. The above equation can be rewritten as
nλwˆ λ = X>
(y − Xwˆ λ). (1)
Now, define α := y − Xwˆ λ.
1
Extra credit (weightage: 2%). Using equation (1) and the definition of α, show that α
satisfies the following equation:
nλXX>α + α = y. (2)
Extra credit (weightage: 1%). Solve equation (2) for α, and use your solution in equation
(1) to get the following alternative expression for wˆ λ:
wˆ λ =
1
nλX>
1
nλXX> + In
−1
y = X>
(XX> + nλIn)
−1y. (3)
Step 2. Now suppose labeled points (x, y) are drawn from X × R, where X is the feature space.
Suppose you are also given a kernel function k : X × X → R which computes the inner product for
a feature map φ : X → H where H is the reproducing kernel Hilbert space1
for k. In other words,
k(x, x
0
) = hφ(x), φ(x
0
)i.
Suppose you have collected a training set of n examples
S = {(x1, y1),(x2, y2), . . . ,(xn, yn) ∈ X × R}.
Let wˆ λ be the solution of the ridge regression problem with regularization parameter λ when
data points are mapped to H using the feature mapping φ. In other words, consider doing ridge
regression using the transformed training set
S
0 = {(φ(x1), y1),(φ(x2), y2), . . . ,(φ(xn), yn) ∈ H × R}.
Mathematically, wˆ λ can be defined as
wˆ λ := arg min
w∈H
1
n
Xn
i=1
(yi − hw, φ(xi)i)
2 + λkwk
2
2
.
Part 1 of assignment: (weightage: 4%) Given a test point x ∈ X , give the pseudocode for
computing the prediction using the ridge regression solution, i.e. hwˆ λ, φ(x)i. You may use equation
(3) for this even if you haven’t attempted to prove the equation in the extra credit question.
Hint: construct the design matrix X for the transformed training set S
0
, and observe that the
matrix XX>
can be computed using the kernel function k.
2 Kernel PCA
Next, we turn to kernel PCA. We use the connection between PCA and SVD to derive a kernelized
form of PCA.
Let X be the design matrix, and let X = USV > be the SVD of X. Recall from class that the
following facts:
1. The columns of U are the eigenvectors of XX>
, with eigenvalues given by the squares of the
diagonal entries in S.
If the jargon “reproducing kernel Hilbert space” alarms you, just think of H as R
D for some D p.
2. The rank ` PCA representation of the training set is given by the columns of
S
−1
` U
>
` XX>
.
Extra credit (weightage: 3%). Now let X be the design matrix for the transformed training
set S
0
. Explain how you can compute S` and U` using the kernel function k without having to
explicitly compute the mapping φ for any training points.
Part 2 of assignment: (weightage: 2%) Suppose you have computed S` and U`
. Specify the
dimensions of these two matrices. Explain how you can compute the rank ` PCA of the transformed
training set using these matrices and the kernel function k without having to explicitly compute
the mapping φ for any training points.