Description
Warm-Up: Subgradients & More (15 pts)
1. Recall that a function f : R
n → R is convex if for all x, y ∈ R
n and λ ∈ [0, 1], λf(x) + (1 −
λ)f(y) ≥ f(λx + (1 − λ)y). Using this definition, show that
(a) f(x) = wf1(x) is a convex function for x ∈ R
n whenever f1 : R
n → R is a convex
function and w ≥ 0
(b) f(x) = f1(x) + f2(x) is a convex function for x ∈ R
n whenever f1 : R
n → R and
f2 : R
n → R are convex functions
(c) f(x) = max{f1(x), f2(x)} is a convex function for x ∈ R
n whenever f1 : R
n → R and
f2 : R
n → R are convex functions
2. Compute a subgradient at the specified points for each of the following convex functions.
(a) f(x) = max{x
2 − 2x, |x|} at x = 0 and x = −2.
(b) g(x) = max{(x − 1)2
,(x − 2)2} at x = 1.5 and x = 0.
Problem 1: Perceptron Learning (30 pts)
Consider the data set (perceptron.data) attached to this homework. This data file consists of M
data elements of the form (x
(m)
1
, x
(m)
2
, x
(m)
3
, x
(m)
4
, y(m)
) where x
(m)
1
, . . . , x
(m)
4 ∈ R define a data point
in R
4 and y
(m) ∈ {−1, 1} is the corresponding class label. In class, we saw how to use the perceptron
algorithm to minimize the following objective.
X
M
m=1
max{0, −y
(m)
· (w
T x
(m) + b)}
In this problem, you are to implement the perceptron algorithm using two different subgradient
descent strategies. For each strategy below, report the number of iterations that it takes to find
a perfect classifier for the data, the values of w and b for the first three iterations, and the final
1
weights and bias. Each descent procedure should start from the initial point
w
0 =
0
0
0
0
b
0 = 0.
1. Standard subgradient descent with the step size γt = 1 for each iteration.
2. Stochastic subgradient descent where exactly one component of the sum is chosen to approximate the gradient at each iteration. Instead of picking a random component at each iteration,
you should iterate through the data set starting with the first element, then the second, and
so on until the Mth element, at which point you should start back at the beginning again.
Again, use the step size γt = 1.
3. How does the rate of convergence change as you change the step size? Provide some example
step sizes to back up your statements.
4. What is the smallest, in terms of number of data points, two-dimensional data set containing
both class labels on which the algorithm, with step size one, fails to converge? Use this
example to explain why the method may fail to converge more generally.
Problem 2: Separability & Feature Vectors (15 pts)
1. Consider the following data set.
−2 −1 0 1 2
−2
−1
0
1
2
x1
x2
Under which of the following feature vectors is the data linearly separable? For full credit,
you must justify your answer by either providing a linear separator or explaining why such a
separator does not exist.
(a) φ(x1, x2) =
x1 + x2
x1 − 2×2
(c) φ(x1, x2) =
exp(x1)
exp(x2)
(b) φ(x1, x2) =
x
2
1
x
2
2
x1x2
(d) φ(x1, x2) =
x1 · sin(x2)
x1
2
2. Suppose that you wanted to perform polynomial regression for 2-dimensional data points
using gradient descent, i.e., you want to fit a polynomial of degree k to your data. Explain
how to do this using feature vectors. What is the per iteration complexity of gradient descent
as a function of the size of your feature representation and the number of training data points?
Problem 3: Exponential Regression (15 pts)
Suppose that you wanted to fit a function of the form f(x) = exp(ax + b) to a collection of data
points (x
(1), y(1)), . . . ,(x
(M)
, y(M)
) ∈ R
2
, where a, b ∈ R.
1. Formulate the regression problem as a minimization of the squared error over a hypothesis
space consisting of exponential functions of the above form.
2. Explain how to use gradient descent in an effort to minimize this loss function. Your explanation should include the computed gradients.
3. Is the optimization problem that you produced convex? Explain.
Problem 4: Support Vector Machines (25 pts)
For this problem, consider the data set (mystery.data) attached to this homework that, like Problem
2, contains four numeric attributes per row and the fifth entry is the class variable (either + or
-).
Find a perfect classifier for this data set using support vector machines. Your solution should
explain the optimization problem that you solved and provide the learned parameters, the optimal
margin, and the support vectors.
3