Description
Debugging tips
In order to verify that your learners are working correctly, you can compare your results with those from a
library such as scikit-learn1
. For a quick debugging check, running the main method in the skeleton file
should give you performance results for a simple debugging dataset that should have relatively high accuracy
for both k-NN (0.05-0.1 loss) and LogisticRegression (0.85 score). LogisticRegression should converge after
44389 iterations for the given parameters.
k-Nearest Neighbors
One of the simplest classifiers we’ve discussed also happens to be one of the most powerful, and surprisingly
straightforward to implement. The short description in words is: “return the class label of the most common
class amongst the k nearest training points,” but for the sake of completeness, here’s a formal description:
1https://scikit-learn.org/
1
Data: query point x, distance function d(x, x
0
), number of neighbors k
Result: estimate of the label for x, ˆy
Compute the distances between x and all of the training points x
(i)
Find the k closest points, {x
(n1)
, x
(n2)
, . . . , x
(nk)} (l < m =⇒ d(x, x
(nl)
) ≤ d(x, x
(nm)
))
Return the most frequent class in {y
(n1)
, y(n1)
, . . . , y(nk)}
In practice, efficient methods exist (k − d trees, locality-sensitive hashing, etc) for doing quick (O(log N)
vs O(N)) lookup that work by building data structures that take advantage of the properties of distance
functions, but for this assignment we’ll stick with the brute-force approach. Implement the algorithm above
in the method KNN.query single pt.
Distance functions
There are three distance functions you should implement in hw2.py: KNN.euclid, KNN.manhattan, and
KNN.mahalanobis, and their definitions are
Euclidean distance
d(x, x
0
) =
vuutX
D
j=1
(xj − x
0
j
)
2 (1)
Manhattan distance
d(x, x
0
) = X
D
j=1
|xj − x
0
j
| (2)
Mahalanobis distance
d(x, x
0
) = q
(x˜ − x˜
0)>Σ−1(x˜ − x˜
0) (3)
x˜ = (x − x¯), Σ = diag(σ
2
1
, σ2
2
, . . . , σ2
D)
Where x¯ is the average of all the training points, and σj is the standard deviation of the jth component.
Note that what we’re calling the Mahalanobis distance is actually a form of centering and re-scaling2
, which
is still useful for making sure that all of our features are comparable at the same scale.
You can implement
this either with the linear algebra representation shown above, or by directly subtracting out the mean and
rescaling each component iteratively. Implement each of these distance functions using the equations above.
Since this computation is the bottleneck for k-NN, you may wish to spend some time trying to optimize
these functions, particularly by making use of NumPy’s ability to do quick vector operations.
Choosing k
While the choice of distance function definitely has an impact on the complexity of the decision boundary,
the relationship is problem-dependent and not always entirely clear.
The choice of k, however, has a direct
connection with complexity: lower k means increased complexity. We already know that increasing complexity reduces bias at the expense of increasing variance, so we should think carefully about how to pick
the best value of k for a given problem.
One way to do this is to look at how error on the training data and error on some (withheld) testing
data changes as a function of k.
The provided code already has functions which will
1. Return a set of testing and training data from a given file
2. Compute the misclassification rate for a given value of k on a set of testing data
2The true Mahalanobis distance requires the covariance matrix of the underlying distribution of the data.
3. Compute the misclassification rate for a given value of k on the training data
You will need to use these methods to explore a reasonable range of values for k for the provided datasets.
The format of their return values should make it easy to plot misclassification error as a function of k, but
you will have to write the code to do this yourself.
Important note: these types of experiments can take a long time to run, so don’t leave this till the
last minute! If even after you’ve optimized your implementation you’re still having trouble running these
experiments in a reasonable amount of time, you can modify the provided code to reduce the size of the
datasets, but you must make a note of this both in your code and in your writeup, and provide an explanation
of how you reduced the data and why what you did makes sense.
Logistic Regression
The Logistic Regression classifier directly models the probability that a given point has a given label. Though
versions exist which model more than two labels, we will focus on a restriction of the problem to only binary
labels, and use what’s known as a “one versus rest” approach to handle problems with multiple classes.
The
form that the model takes is
p(y = 1 | x) = hθ(x) = σ(θ
>x) =
1 + exp(−θ
>x)
−1
(4)
Fill in the method LogisticRegression.prob single pt which should implement the above equation.
Note that the provided sigmoid function has built in checks to make sure the output never reaches 0 or 1,
which can cause issues with some of the computations performed later. If you see NaN’s or inf’s show up in
your answers, make sure you’re not encountering numerical overflow or underflow.
We discussed in class some of the nice properties that the sigmoid function (σ(z)) has, chief among them
being the fact that the gradient is especially easy to compute. The gradient of the negative log likelihood is
given as
∇θNLL(hθ; S) = X
N
i=1
h
hθ(x
(i)
) − y
(i)
i
x
(i)
(5)
= X>(hθ − Y ) (6)
where X is the N × (D + 1) matrix of (1’s augmented) training data points, hθ is the column vector of
predicted outputs for each of the training points, and Y is the column vector of 1’s and 0’s such that each
entry is 1 if the label matches the positive class and 0 otherwise.
Given this gradient, we can compute the optimal parameters by using gradient descent. The function for
gradient descent has been provided to you in the template code, the only thing that you need to do is fill in
LogisticRegression.NLL gradient so that it returns the actual gradient for the current value of the parameters in self.Theta.
You may want to look at how the function LogisiticRegression.negative log likelihood
is implemented for ideas on how to use fast NumPy vector operations to compute the gradient.
Once you have these two functions implemented, you should be able to compare the performance of your
logistic regression model with k-NN. For single-class problems, you can threshold the probability at 0.5 and
use the resulting binary labels to compute the misclassification rate.
For multi-class problems, you should
fit one learner per class: you can specify the positive class label in the constructor to LogisticRegression,
and when training this should treat any class label that’s not the positive class label as a negative class label.
Then, when testing, pick the positive class label of the logistic regression model with the highest probability.
This is implemented for you in the multiclass logistic score function.
Support Vector Machines
Support Vector Machines are another linear classifier, but unlike logistic regression, they are built on the
“maximum margin” principle. The classifier itself has a similar form to logistic regression, but replacing the
sigmoid function with a hard threshold
hw,w0
(x) = sgn(w>x + w0) (7)
Replacing the sigmoid function with a hard threshold makes it so we can’t use the same nice gradient
descent we’ve been using all along, but it turns out there is another (quadratic, constrained) optimization
problem we can solve to find the best weights:
α = arg max
α
X
N
i=1
α
(i) −
1
2
X
N
i=1
X
N
j=1
α
(i)α
(j)
y
(i)
y
(j)x
(i)>x
(j)
(8)
Subject to
X
N
i=1
α
(i)
y
(i) = 0, α(i) ≥ 0, ∀α
(i)
where the original parameters can be recovered as
w =
X
N
i=1
α
(i)
y
(i)x
(i)
(9)
While the equations look complex, they’re describing a very simple geometric picture: the best line for
separating the data is the line which maximizes the distance between itself and the data. That way, any
noise in the training data is less likely to change the predicted class label. In addition to this nice geometric
interpretation, SVMs also allow us to easily perform something akin to basis function expansion but for
classification instead of regression. We achieve this by replacing the inner product x
>x
0 with a function k
such that
k(x, x
0
) = hφ(x), φ(x
0
)i = φ(x)
>φ(x
0
)
For example, if we wanted to compute the linear SVM for 1D data that had been transformed so that
φ(x) = [x
2
1
,
√
2×1, 1]> we could use the following kernel
k(x, x
0
) = φ(x)
>φ(x
0
)
= (x1x
0
1
)
2 + 2(x1x
0
1
) + 1
= (x1x
0
1 + 1)2
For x in two dimensions, including all the cross terms gives us φ(x) = [x
2
1
, x2
2
,
√
2x1x2,
√
2×1,
√
2×2, 1]>,
which is in 6 dimensions, but the kernel function still operates in 2:
k(x, x
0
) = φ(x)
>φ(x
0
)
= (x1)
2
(x
0
1
)
2 + (x2)
2
(x
0
2
)
2 + 2x1x2x
0
1x
0
2 + 2x1x
0
1 + 2x2x
0
2 + 1
= (x1x
0
1 + x2x
0
2 + 1)(x1x
0
1 + x2x
0
2 + 1)
= (x
>x
0 + 1)2
This shows that instead of computing a potentially high-dimensional transform, we can use an easy to
compute kernel function while still gaining the complexity of operating in higher dimensions. This process
of replacing inner products with a function is known as the “kernel trick” throughout ML. You’ll need to
generalize the above result to answer the questions on SVMs at the end of this document.
Questions
1. k-Nearest Neighbors
(a) Find a good setting for k for the datasets HTRU, iris, and optdigits using the euclid distance
metric. Include plots (one for each dataset, three plots total) which show both testing and training
error as a function of k, and explain your reasoning.
(b) Repeat the previous question for the other two distance metrics, manhattan, and mahalanobis
(so six more plots). Which had the best performance, and at which value of k?
2. Logistic Regression
(a) Fit a Logistic Regression model to the dataset HTRU. Compare the performance with k-NN.
(b) The other two datasets have more than one class. Fit a Logistic Regression model for each class
(3 for iris and 10 for optdigits) and again compare the performance with k-NN.
(c) In general, under what circumstances does Logistic Regression perform well? Looking at your
results from the previous questions, do you think these conditions hold for any of the given
datasets? Explain your reasoning.
3. Support Vector Machines
(a) Show how you can re-write the model
hw,w0
(x) = sgn
w0 + φ(x)
>w
so that the feature transform of the inputs, φ(x), never has to be explicitly computed. That is,
show how you can go from the expression above to the one given in the slides. Explain why you
need to do this to use the “kernelized” version of SVMs, and what role α
(i) plays in the efficiency
of this computation.
(b) How many dimensions would the feature transform φ(x) have if computed explicitly for the
following kernels?
klin(x, x
0
) = x
>x
0
kpoly(x, x
0
) = (1 + x
>x
0
)
M
kexp(x, x
0
) = exp
−
kx − x
0k
2
2σ
2
Assume that the x have D components. (Hint for kexp: consider the Taylor series expansion of
e
z
.)
(c) (Murphy exercise 14.1) Consider a dataset with 2 points in 1d: (
√
x1 = 0, y1 = −1) and (x2 =
2, y2 = 1). Consider mapping each point to 3d using the feature vector φ(x) = [1,
√
2x, x2
]
>.
(This is equivalent to using a second order polynomial kernel.) The max margin classifier has the
form
minkwk
2
s.t.
y1(w>φ(x1) + w0) ≥ 1
y2(w>φ(x2) + w0) ≥ 1
Write down a vector that is parallel to the optimal vector w. Hint: recall that w is perpendicular
to the decision boundary between the two points in the 3d feature space.
(d) What is the value of the margin that is achieved by this w? Hint: recall that the margin is the
distance from each support vecctor to the decision boundary. Hint 2: think about the geometry
of 2 points in space, with a line separating one from the other.
(e) Solve for w, using the fact that the margin is equal to 1/kwk.
(f) Solve for w0 using your value for w and the two constraints listed above. Hint: the points will be
on the decision boundary, so the inequalities will be tight.
(g) Write down the form of the discriminant function f(x) = w0 + w>φ(x) as an explicit function of
x.