## Description

Problem 1. (15 points). SVM’s obtain non-linear decision boundaries by mapping the feature

vectors X¯ ∈ R

d

to a possibly high dimensional space via a function φ : R

d → R

m, and then finding

a linear decision boundary in the new space.

We also saw that to implement SVM, it suffices to know the kernel function K(X¯

i

, X¯

j ) =

φ(X¯

i) · φ(X¯

j ), without even explicitly specifying the function φ.

Recall Mercer’s theorem. K is a kernel function if and only if for any n vectors, X¯

1, . . . ,X¯

n ∈

R

d

, and any real numbers c1, . . . , cn,

Pn

i=1

Pn

j=1 cicjK(X¯

i

, X¯

j ) ≥ 0.

1. Prove the following half of Mercer’s theorem (which we showed in class). If

P

K is a kernel then

n

i=1

Pn

j=1 cicjK(X¯

i

, X¯

j ) ≥ 0.

2. Let d = 1, and x, y ∈ R. Is the function K(x, y) = x + y a kernel?

3. Let d = 1, and x, y ∈ R. Is K(x, y) = xy + 1 a kernel?

4. Suppose d = 2, namely the original features are of the form X¯

i = [X¯ 1

, X¯ 2

]. Show that

K(X, ¯ Y¯ ) = (1 + X¯ · Y¯ )

2

is a kernel function. This is called as quadratic kernel.

(Hint: Find a φ : R

2 → R

m (for some m) such that φ(X¯) · φ(Y¯ ) = (1 + X¯ · Y¯ )

2

).

5. Consider the training examples h[0, 1], 1i,h[1, 2], 1i,h[−1, 2], 1i,h[0, 11], 1i,h[3, 4], −1i,h[−3, 4], −1i,

h[1 − 1], −1i,h[−1, −1], −1i. We have plotted the data points below.

• Is the data linearly classifiable in the original 2-d space? If yes, please come up with

any linear decision boundary that separates the data. If no, please explain why.

• Is the data linearly classifiable in the feature space corresponding to the quadratic kernel.

If yes, please come up with any linear decision boundary that separates the data. If no,

please explain why.

1

−8 −6 −4 −2 2 4 6 8

−5

5

10

15

Problem 2. (10 points). Let f, hi

, 1 ≤ i ≤ n be real-valued functions and let α ∈ R

n

. Let

L(z, α) = f(z) + Pn

i=1

αihi(z). In this problem, we will prove that the following two optimization

problems are equivalent.

min

z

f(z)

s.t. hi(z) ≤ 0, i = 1, . . . , n.

(1) min

z

max

α≥0

L(z, α)

(2)

Let (z

∗

, α∗

) be the solution of (2) and let z

∗

p be the solution of (1). Prove that:

L(z

∗

, α∗

) = f(z

∗

p

)

Hint: Use the fact that for any z, α ≥ 0, L(z

∗

, α∗

) ≥ L(z

∗

, α) and L(z

∗

, α∗

) ≤ L(z, αz), where

αz = arg maxα≥0 L(z, α).

You may follow the following steps but it is not required as long as your proof is correct.

1. Prove that L(z

∗

, α∗

) ≤ f(z

∗

p

)

2. Prove that L(z

∗

, α∗

) ≥ f(z

∗

p

)

Problem 3. (15 points). In this problem, we derive the dual formulation of the soft-margin

SVM problem with ¯ξ = ξ1, . . . , ξn.

min

w, ¯ ξ¯

1

2

kw¯k

2

2 + C ·

Xn

i=1

ξi

such that

1 − ξi − yi(X¯

i

· w¯ − t) ≤ 0, i = 1, . . . , n,

− ξi ≤ 0, i = 1, . . . , n.

(3)

2

Now we can define 2n Lagrangian variables α = α1, . . . , αn, and β = β1, . . . , βn corresponding to

these equations and obtain the following Lagrangian.

L( ¯w, t, ¯ξ, α, β) = 1

2

kw¯k

2

2 + C ·

Xn

i=1

ξi +

Xn

i=1

αi(1 − ξi − yi(X¯

i

· w¯ − t)) −

Xn

i=1

βiξi (4)

The original problem is now equivalent to

min

w,t, ¯ ξ¯

max

α≥0,β≥0

L( ¯w, t, ¯ξ, α, β)

.

For this objective, minmax theorem says that the min max problem is equivalent to the max min

problem below. You do not have to prove this, but are encouraged to do so (using the argument

given in the discussion earlier). This gives the following problem.

max

α≥0,β≥0

min

w,t, ¯ ξ¯

L( ¯w, t, ¯ξ, α, β)

.

1. For a fixed α, β take the gradient of the the Lagrangian with respect to ¯w, and express ¯w in

terms of the other variables.

2. Differentiate the Lagrangian with respect to t, and equate to zero to obtain another equation.

3. Differentiate the Lagrangian with respect to ξi and show that αi + βi = C at the optimum.

This shows that αi ≤ C, since βi ≥ 0.

4. The expressions from 1, 2, 3 define one of the KKT conditions. Show that under these

conditions,

min

w,t, ¯ ξ¯

L( ¯w, t, ¯ξ, α, β) = Xn

i=1

αi −

1

2

Xn

i=1

Xn

j=1

αiαjyiyj (X¯

i

· X¯

j ).

Basically, use the expressions from 1, 2, and 3 to ”cancel out” ¯w, t, ¯ξ in the Lagrangian.

5. Combine the results above to argue that the following optimization is equivalent to the soft

margin SVM we started with.

max

α≥0

Xn

i=1

αi −

1

2

Xn

i=1

Xn

j=1

αiαjyiyj (X¯

i

· X¯

j )

such that

0 ≤ αi ≤ C, i = 1, . . . , n.

(5)

Problem 4 (25 points) SVM Classification. Please refer to the Jupyter Notebook in the assignment, and complete the coding part in it! You can use sklearn SVM package: https://scikitlearn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC

3