Description
1 Perceptron [10 pts]
Design (specify θ for) a two-input perceptron (with an additional bias or offset term) that computes
the following boolean functions. Assume T = 1 and F = −1. If a valid perceptron exists, show
that it is not unique by designing another valid perceptron (with a different hyperplane, not simply
through normalization). If no perceptron exists, state why.
(a) AND (b) XOR
2 Logistic Regression [10 pts]
Consider the objective function that we minimize in logistic regression:
J(θ) = −
X
N
n=1
[yn log hθ (xn) + (1 − yn) log (1 − hθ (xn))]
(a) Find the partial derivatives ∂J
∂θj
.
3 Locally Weighted Linear Regression [10 pts]
Consider a linear regression problem in which we want to “weight” different training instances
differently because some of the instances are more important than others. Specifically, suppose we
want to minimize
J(θ0, θ1) = X
N
n=1
wn (θ0 + θ1xn,1 − yn)
2
.
Here wn > 0. In class, we worked out what happens for the case where all the weights (the wn’s)
are the same. In this problem, we will generalize some of those ideas to the weighted setting.
(a) Calculate the gradient by computing the partial derivatives of J with respect to each of the
parameters (θ0, θ1).
(b) Set each partial derivatives to 0 and solve for θ0 and θ1 to obtain values of (θ0, θ1) that
minimize J.
2
4 Understanding Linear Separability [25 pts]
Definition 1 (Linear Program) A linear program can be stated as follows:
Let A be an m × n real-valued matrix, ~b ∈ R
m, and ~c ∈ R
n
. Find a ~t ∈ R
n
, that minimizes the
linear function
z(~t) = ~c T~t
subject to A~t ≥ ~b
In the linear programming terminology, ~c is often referred to as the cost vector and z(~t) is referred
to as the objective function.
1 We can use this framework to define the problem of learning a linear
discriminant function.2
The Learning Problem:3 Let ~x1, ~x2, . . . , ~xm represent m samples, where each sample ~xi ∈ R
n
is an n-dimensional vector, and ~y ∈ {−1, 1}
m is an m × 1 vector representing the respective labels
of each of the m samples. Let ~w ∈ R
n be an n × 1 vector representing the weights of the linear
discriminant function, and θ be the threshold value.
We predict ~xi to be a positive example if ~w
T ~xi + θ ≥ 0. On the other hand, we predict ~xi to be a
negative example if ~w
T ~xi + θ < 0.
We hope that the learned linear function can separate the data set. That is,
yi =
(
1 if ~w
T ~xi + θ ≥ 0
−1 if ~w
T ~xi + θ < 0.
(1)
In order to find a good linear separator, we propose the following linear program:
min δ
subject to yi( ~w
T
~xi + θ) ≥ 1 − δ, ∀(~xi
, yi) ∈ D
δ ≥ 0 (2)
where D is the data set of all training examples.
1Note that you don’t need to really know how to solve a linear program since we will use Matlab to obtain the
solution (although you may wish to brush up on Linear Algebra). Also see the appendix for more information on
linear programming.
2This discussion closely parallels the linear programming representation found in Pattern Classification, by Duda,
Hart, and Stork.
3Note that the notation used in the Learning Problem is unrelated to the one used in the Linear Program
definition. You may want to figure out the correspondence.
3
(a) A data set D = {(~xi
, yi)}
m
i=1 that satisfies condition (1) above is called linearly separable.
Show that if D is linearly separable, there is an optimal solution to the linear program (2)
with δ = 0
(b) Now show that there is an optimal solution with δ = 0, then D is linearly separable.
(c) What can we say about the linear separability of the data set if there exists a hyperplane
that satisfies condition (2) with δ > 0?
(d) An alternative LP formulation to (2) may be
min δ
subject to yi( ~w
T
~xi + θ) ≥ −δ, ∀(~xi
, yi) ∈ D
δ ≥ 0
Find the optimal solution to this formulation (independent of D) to illustrate the issue with
such a formulation.
(e) Let ~x1 ∈ R
n
, ~x1
T =
1 1 1
and y1 = 1. Let ~x2 ∈ R
n
, ~x2
T =
−1 −1 −1
and y2 = −1.
The data set D is defined as
D = {( ~x1, y1),( ~x2, y2)}.
Consider the formulation in (2) applied to D. What are possible optimal solutions?
4