Description
Problem 1 (10 points). Recall that a linear classifier is specified by (−→w , t), where −→w ∈ R
d
, and
t ∈ R, and a decision rule −→w ·
−→X ≶ t for a feature vector −→X ∈ R
d
.
Recall the perceptron algorithm from the lectures. Suppose d = 2, n = 4, and there are four
training examples, given by:
i feature −→Xi
label yi
1 (−0.6, 1) −1
2 (−3, −4) −1
3 (3, −2) +1
4 (0.5, 1) +1
1. In the class we started with the the initial (−→w , t) = (−→0 , 0), and derived the convergence of
perceptron. In this problem:
• Start with the initialization (−→w , t) = ((0, 1), 0) (the xaxis).
• Implement the perceptron algorithm by hand. Go over the datapoints in order.
Output a table of the form below where each row works with one example in some
iteration. We have filled in some entries in the first two rows. You need to add rows
until no mistakes happen on any example.
starting −→w starting t features −→Xi
label yi predicted label new −→w new t
(0, 1) 0 (−0.6, 1) −1 . . . . . . . . .
. . . . . . (−3, −4) −1 . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
• Draw a 2d grid. On this grid, mark the four examples (like we do on the board). Draw
the line you obtain as the final result.
1
Problem 2 (10 points). Recall the loglikelihood function for logistic regression:
J(
−→w , t) = Xn
i=1
log P r(yi

−→Xi
,
−→w , t),
where P r(yi

−→Xi
,
−→w , t) is the same as defined in the class for logistic regression, and yi ∈ {−1, +1}.
We will show that this function is concave as a function of −→w , t.
1. Show that for two real numbers a, b, exp (a) + exp (b) ≥ 2 exp ((a + b)/2). (Basically this says
that exponential function is convex.)
2. Extending this, show that for any vectors −→w 1,
−→w 2,
−→x ∈ R
d
,
exp (−→w 1 ·
−→x ) + exp (−→w 2 ·
−→x ) ≥ 2 exp
(
−→w 1 +
−→w 2)
2
·
−→x
.
3. Show that J(
−→w , t) is concave (you can only show the concavity holds for λ = 1/2). You can
show it any way you want. One way is to first show that for any −→w 1,
−→w 2 ∈ R
d
, and t1, t2 ∈ R,
1
2
J(
−→w 1, t1) + 1
2
J(
−→w 2, t2) ≤ J
−→w 2 +
−→w 2
2
,
t1 + t2
2
.
You can use that sum of concave functions are concave. A linear function is both concave
and convex.
In this problem you can also work with the vector −→w
∗
, which is the d + 1 dimensional vector
(
−→w , t), and the d + 1 dimensional feature vectors −→X∗
i = (−→Xi
, −1), which are the features appended
with a −1. Just like in class, this can simplify some of the computations by using the fact that
−→w ·
−→Xi − t =
−→w
∗
·
−→X∗
i
.
Problem 3. (15 points). Consider the same setup as in perceptron, where the features all
satisfy 
−→Xi
 ≤ 1, and the training examples are separable with margin γ. We showed in class that
perceptron converges with at most 4/γ2 updates when initialized to the all all zero vectors, namely
to (−→0 , 0). Suppose instead the initial start with some initial (−→w0, t0) with −→w0 ≤ R, and t0 ≤ R.
We run the perceptron algorithm with this initialization. Suppose, (−→wj , tj ) is the hyperplane after
jth update. Let (−→w opt, topt) be the optimal hyperplane. Then,
Similar to what we did in class, assume that the d + 1 dimensional vector (−→w opt, topt) satisfies
(
−→w opt, topt)2 ≤ 2.
1. Show that
(
−→w j , tj ) · (
−→w opt, topt) ≥ jγ − 2R.
2. Show that
(
−→w j , tj ) · (
−→w j , tj ) ≤ 2j + 2R
2
.
3. Using these conclude that the number of updates before perceptron converges is at most
4 + 4Rγ
γ
2
updates.
Problem 4 (25 points). See the attached notebook for details.
2