Description
1. SVM with Biases.
This problem is about SVMs over R
d with linearly separable data (i.e., the hard-margin SVM).
Our formulation of SVM required separators to pass through the origin, which differs from some natural
geometric notions of margin.
A first fix is provided by lecture 3: by appending a 1 to the inputs, we obtain the convex program
min
u
1
2
∥u∥
2
subject to u ∈ R
d+1
1 − yi
xi
1
T
u ≤ 0 ∀i,
and let u¯ denote the optimal solution to this program. (Recall that a convex program is an optimization
problem where the objective function is convex, and the constraints are upper bounds on convex functions.)
A second standard fix is to incorporate the bias directly into the optimization problem:
min
v,b
1
2
∥v∥
2
subject to v ∈ R
d
, b ∈ R
1 − yi(v
Txi + b) ≤ 0 ∀i,
and let (v¯,
¯b) ∈ R
d × R denote an optimal solution to this program. This second version is fairly common
presentation of the SVM, but introduces a variety of complexities and for that reason is not presented in
lecture.
(a) [hw2] Show that the second formulation is also a convex program (i.e., show that the objective function
and the constraints are all convex in the optimization variable (v, b)).
(b) [hw2] Suppose there is only one datapoint: x1 = e1, the first standard basis vector, with label y1 = +1.
The first formulation will have a unique solution u¯, as discussed in lecture. Show that the second
formulation does not have a unique solution.
(c) [hw2] Let’s add another datapoint: x2 = −ae1 for some a ≥ 3, with label y2 = −1. Now that we
have two data points, both of the convex programs now have two constraints. Write out the explicit
constraints to the first convex program.
(d) [hw2] Using these two constraints, show that the first coordinate u¯1 of the optimal solution u¯ satisfies
u¯1 ≥
2
a+1 .
(e) [hw2] Using parts (c) and (d), find optimal solutions u¯ and (v¯,
¯b), and prove they are in fact optimal.
Hint: If you are stuck, first try the case d = 1. Then study what happens for d = 2, d = 3, . . .
Hint: (v¯,
¯b) will be unique.
(f) [hw2] Now we will consider the behavior of u¯ and v¯ as a increases; to this end, write u¯a and v¯a,
and consider a → ∞. Determine and formally prove the limiting behavior of lima→∞
1
2
∥u¯a∥
2 and
lima→∞
1
2
∥v¯a∥
2
.
Hint: The two limits will not be equal.
(g) [hw2] Between the two versions of SVM with bias, which do you prefer? Any answer which contains at
least one complete sentence will receive full credit.
Remark: Initially it may have seemed that both optimization problems have the same solutions; the
purpose of this problem was to highlight that small differences in machine learning methods can lead to
observably different performance.
Solution.
2
2. SVM Implementation.
Recall that the dual problem of an SVM is
max
α∈C
Xn
i=1
αi −
1
2
Xn
i,j=1
αiαjyiyjK(xi
, xj ),
where the domain C = [0, ∞)
n = {α : αi ≥ 0} for a hard-margin SVM, and C = [0, C]
n = {α : 0 ≤ αi ≤ C}
for a soft-margin SVM. Equivalently, we can frame this as the minimization problem
min
α∈C
f(α) :=
1
2
Xn
i,j=1
αiαjyiyjK(xi
, xj ) −
Xn
i=1
αi
.
This can be solved by projected gradient descent, which starts from some α0 ∈ C (e.g., 0) and updates via
αt+1 = ΠC
αt − η∇f(αt)
,
where ΠC[α] is the projection of α onto C, defined as the closest point to α in C:
ΠC[α] := arg min
α′∈C
1
2
∥α
′ − α∥
2
2
.
If C is convex, the projection is uniquely defined.
(a) [hw2] Prove that
Π[0,∞)n [α]
i
= max{αi
, 0},
and
Π[0,C]n [α]
i
= min{max{0, αi}, C}.
Hint: Show that the ith component of any other α′ ∈ C is further from the ith component of α than the
ith component of the projection is. Specifically, show that
α
′
i − αi
≥
max {0, αi} − αi
for α′ ∈ [0, ∞)
n
and that
α
′
i − αi
≥
min
max {0, αi} , C
− αi
for α′ ∈ [0, C]
n.
(b) [hw2code] Implement an svm solver(), using projected gradient descent formulated as above. Initialize
your α to zeros. See the docstrings in hw2.py for details.
Remark: Consider using the .backward() function in pytorch. However, then you may have to use
in-place operations like clamp (), otherwise the gradient information is destroyed.
Library routines: torch.outer, torch.clamp, torch.autograd.backward, torch.tensor(…,
requires grad=True), with torch.no grad():, torch.tensor.grad.zero (),
torch.tensor.detach().
(c) [hw2code] Implement an svm predictor(), using an optimal dual solution, the training set, and an
input. See the docstrings in hw2.py for details.
Library routines: torch.empty.
(d) [hw2] On the area [−5, 5] × [−5, 5], plot the contour lines of the following kernel SVMs, trained on the
XOR data. Different kernels and the XOR data are provided in hw2 utils.py. Learning rate 0.1 and
10000 steps should be enough. To draw the contour lines, you can use hw2 utils.svm contour().
• The polynomial kernel with degree 2.
• The RBF kernel with σ = 1.
• The RBF kernel with σ = 2.
• The RBF kernel with σ = 4.
Include these four plots in your written submission.
Solution.
3
3. Convexity of lnPexp.
In this problem, you will analyze a convex approximation of the max function and get familiar with techniques
establishing convexity of a function. Denote the max function ϕ : R
n → R and its approximation ψ : R
n → R
as
ϕ(x) := max(x1, . . . , xn), and ψ(x) := ln
Xn
i=1
exp(xi)
.
Furthermore, throughout the problem, for any vector x ∈ R
n, denote exp(x) :=
exp(x1), . . . , exp(xn)
.
(a) [hw2] Prove that
ϕ(x) ≤ ψ(x) ≤ ϕ(x) + ln(n).
Hint: Show that ϕ(exp(x)) ≤
Pn
i=1 exp(xi) ≤ n · ϕ(exp(x)).
Remark: Part (a) quantifies how well ψ approximates the max function.
(b) [hw2] Use part (a) to show that
limc→∞
ψ(cx)
c
= ϕ(x).
(c) [hw2] Prove that the max function ϕ is convex.
(d) [hw2] Compute the Hessian ∇2ψ.
(e) [hw2] Define λi
:= P
exp(xi)
n
j=1 exp(xj )
for i ∈ [n]. Rewrite the Hessian in part (d) in terms of {λ1, . . . , λn}.
(f) [hw2] Show that the Hessian ∇2ψ(x) is positive semi-definite for all x ∈ R
n. From lecture 3, it follows
that ψ is convex.
Hint: An equivalent definition of a positive semi-definite matrix M ∈ R
n×n is that for any v ∈ R
n,
v
⊺Mv ≥ 0. Use this definition, part (e), and Jensen’s inequality (see appendix to lecture 2, which will
be uploaded and updated before October 14 lecture).
(g) [hw2] Directly show that for all α ∈ [0, 1] and for any x, y ∈ R
n,
ψ
αx + (1 − α)y
≤ αψ(x) + (1 − α)ψ(y).
Hint: Fix x, y ∈ R
n and denote a = exp(x) and b = exp(y). Write ψ
αx + (1 − α)y
as
ln Pn
i=1 a
α
i
b
(1−α)
i
and apply H¨older’s inequality (see appendix to lecture 2 after October 14).
Remark: This gives an alternate proof that ψ is convex. Note that this proof does not use the fact
that ψ is twice differentiable.
Solution.
4
4. LLM Use, collaboration, and other sources.
[hw2] Please document, in detail, all your sources, including LLMs, friends, internet resources, etc. For
example:
1a. I asked my friend, then I found a different way to derive the same solution.
1b. GPT-5 solved the problem in one shot, but then I rewrote it once on paper, and a few days later tried
to re-derive an answer from scratch.
1c. I accidentally found this via a google search, and had trouble forgetting the answer I found, but still
typed it from scratch without copy-paste.
1d. . . .
.
.
.
Solution.
5

