Description
1 Task
Your task is to build a Support Vector Machine for classification. You will make
use of a library routine to solve the convex optimization problem which emerges
in the dual formulation of the support vector machine.
You need to write
code for structuring the data so that the library routine can find the maximalmargin solution, and code for transforming this solution into a classifier which
can process new data.
You will work in Python and use the numpy and pylab packages. You will
also need the cvxopt package for convex optimization.
For reporting, you need to produce plots showing the decision boundary
when using different kernel funcions. Optionally, you can also study how the
introduction of slack variables change the boundary.
Note: The package cvxopt is not installed on the Sun computers in the lab
rooms. Use the GNU/Linux machines in Gr¨on, or log in remotely (via
ssh -X) on either u1.nada.kth.se or u2.nada.kth.se.
2 Theory
The idea is to build a classifier which first makes an (optional) transformation
of the input data, and then a linear separation where the decision boundary is
placed to give maximal margins to the available data points. The location of the
decision boundary is given by the weights ( ~w) and the bias (b) so the problem
is to find the values for ~w and b which maximizes the margin, i.e. the distance
to any datapoint.
The primal formulation of this optimization problem can be stated mathematically like this:
min
~w,b
|| ~w|| (1)
1
under the constraints
ti( ~w
T
· φ(~xi) + b) ≥ 1 ∀i (2)
where we have used the following notation:
~w Weight vector defining the separating hyperplane
b Bias for the hyperplane
~xi The ith datapoint
ti Target class (−1 or 1) for datapoint i
φ(. . .) Optional transformation of the input data
The constraints (2) enforce that all datapoints are not only correctly classified, but also that they stay clear of the decision boundary by a certain margin.
Solving this optimization problem results in values for ~w and b which makes it
possible to classify a new datapoint ~x? using this indicator function:
ind(~x?
) = ~w
T
· φ(~x?
) + b (3)
If the indicator returns a positive value, we say that ~x? belongs to class 1, if
it gives a negative value, we conclude that the class is −1. All the training data
should have indicator values above 1 or below −1, since the interval between
−1 and 1 constitutes the margin.
The bias variable, b, can be eliminated by a standard trick, i.e. by incorporating it as an extra element in the weight vector ~w. We then need to let the
φ(. . .) function append a corresponding constant component, typically the value
1. Note that by using this trick we are actually slightly modifying the problem,
since we are now also including the bias value in the cost function (|| ~w||). In
practice, this will not make much difference, and we will use this “bias free”
version from here on.
2.1 Dual Formulation
The optimization problem can be transformed into a different form, called the
dual problem which has some computational advantages. In particular, it makes
it possible to use the kernel trick, thereby eliminating the need for evaluating
the φ(. . .) function directly. This allows us to use transformations into very
high-dimensional spaces without the penalty of excessive computational costs.
The dual form of the problem is to find the values αi which minimizes:
1
2
X
i
X
j
αiαj titjK(~xi
, ~xj ) −
X
i
αi (4)
subject to the constraints
αi ≥ 0 ∀i (5)
The function K(~xi
, ~xj ) is called a kernel function and computes the scalar
value corresponding to φ(~xi)· φ(~xj ). This is, however, normally done implicitly,
2
i.e., without actually computing the two vectors and taking their scalar product
(see section 2.4).
The indicator function now takes the form:
ind(~x?
) = X
i
αitiK(~x?
, ~xi) (6)
For normal data sets, only a handful of the α’s will be non-zero. Most of the
terms in the indicator function will therefore be zero, and since this is known
beforehand its evaluation can often be made very efficient.
2.2 Matrix Formulation
The dual problem can be expressed in a more compact form using vectors and
matrices: find the vector ~α which minimizes
1
2
~α
T P ~α − ~α · ~1 where ~α ≥ ~0 (7)
~1 denotes a vector where all elements are one. Correspondingly, ~0 is a vector
with all zeros. We have also introduced the matrix P, with these elements:
Pi,j = titjK(~xi
, ~xj ) (8)
In fact, this is a standard form for formulating quadratic optimization problems with linear constraints. This is a well known class of optimization problems
where efficient solving algorithms are available.
2.3 Adding Slack Variables
The above method will fail if the training data are not linearly separable. In
many cases, especially when the data contain some sort of noise, it is desirable
to allow a few datapoints to be misclassified if it results in a substantially wider
margin. This is where the method of slack variables comes in.
Instead of requiring that every datapoint is on the right side of the margin
(equation 2) we will now allow for mistakes, quantified by variables ξi (one for
each datapoint). These are called slack variables. The constraints will now be
ti( ~w
T
· φ(~xi)) ≥ 1 − ξi ∀i (9)
(Remember that we have already eliminated b from (2) by including it as a
weight).
To make sense, we must ensure that the slack variables do not become unnecessarily large. This is easily achieved by adding a penalty term to the cost
function, such that large ξ values will be penalized:
min
~w,ξ~
|| ~w|| + C
X
i
ξi (10)
The new parameter C sets the relative importance of avoiding slack versus
getting a wider margin. This has to be selected by the user, based on the
character of the data. Noisy data typically deserve a low C value, allowing for
more slack, since individual datapoints in strange locations should not be taken
too seriously.
Fortunately, the dual formulation of the problem need only a slight modification to incorporate the slack variables. In fact, we only need to add an extra
set of constraints to (5):
αi ≥ 0 ∀i and αi ≤ C ∀i (11)
Equation (4) stays the same.
2.4 Selection of Kernel Function
One of the great advantages to support vector machines is that they are not
restricted to linear separation. By transforming the input data non-linearly to
a high-dimensional space, more complex decision boundaries can be utilized.
In the dual formulation, these transformed data points φ(~xi) always appear in
pairs, and the only thing needed is the scalar product between the pair. This
makes it possible to use what is often referred to as the kernel trick, i.e. we do
not actually have to make the data transformation but, instead, we use a kernel
function which directly returns the scalar product φ(~xi) · φ(~xj ).
Here are the most commonly used kernel functions:
• Linear kernel
K(~x, ~y) = ~xT
· ~y + 1
This kernel simply returns the scalar product between the two points.
This results in a linear separation. Note the addition of 1 which comes
from the elimination of the bias (b) by appending a constant element 1 to
the data, resulting in an extra term 1 × 1 added to the scalar product.
• Polynomial kernels
K(~x, ~y) = (~xT
· ~y + 1)p
This kernel allows for curved decision boundaries. The exponent p (a
positive integer) controls the degree of the polynomials. p = 2 will make
quadratic shapes (ellipses, parabolas, hyperbolas). Setting p = 3 or higher
will result in more complex shapes.
• Radial Basis Function kernels
K(~x, ~y) = e
−
(~x−~y)
2
2σ2
This kernel uses the explicit difference between the two datapoints, and
often results in very good boundaries. The parameter σ can be used to
control the smoothness of the boundary.
• Sigmoid kernels
K(~x, ~y) = tanh(k~xT
· ~y − δ)
This is yet another possible non-linear kernel. Parameters k and δ need
to be tuned to get best performance.
3 Implementation
We will make use of a Python package for convex optimization1
called cvxopt.
In particular, we will call the function qp to solve our quadratic optimization
problem. The cvxopt package relies on its own implementation of matrices so
we must convert numpy arrays to cvxopt matrices.
Start by importing qp and matrix from cvxopt, and the other packages you
will need:
from cvxopt . s o l v e r s import qp
from cvxopt . b a se import ma t rix
import numpy , pylab , random , math
matrix is a function which takes anything that can be interpreted as a
matrix, for example a numpy array or an ordinary Python list of lists, and
converts it into a cvxopt matrix which can be passed as a parameter to qp.
The call to qp can look like this:
r = qp ( ma trix (P) , ma trix ( q ) , ma t rix (G) , ma trix ( h ) )
alph a = l i s t ( r [ ’ x ’ ] )
This will find the ~α which minimizes
1
2
~α
T P ~α + ~qT
~α while G~α ≤ ~h (12)
Here, P and G are matrices, while ~q and ~h are vectors.
As you can see, this is very similar to the problem we have. This is no
coincidence, because we have formulated the problem in a standard way. To use
qp for our problem we only need to build the necessary vectors and matrices;
then qp will give us the optimal α’s.
3.1 Things to implement
You will have to write code for:
1Quadratic problems are a special case of convex problems.
• A suitable kernel function
The kernel function takes two data points as arguments and returns a
“scalar product-like” similarity measure; a scalar value. Start with the
linear kernel which is almost the same as an ordinary scalar product. Do
not forget that you need to add 1 to the output, representing the extra
“always-one” component.
• Build the P matrix from a given set of data points
From the theory section we know that the P matrix should have the
elements
Pi,j = titjK(~xi
, ~xj )
Indices i and j run over all the data points. Thus, if you have N data
points, P should be an N × N matrix.
• Build the ~q vector, G matrix, and ~h vector
These vectors and matrices do not hold any actual data. Still, they need
to be set up properly so that qp solves the right problem.
By matching our problem (equation 7) with what qt solves (equation 12)
we can see that ~q should be a N long vector containing only the number
−1. Similarly, we realize that ~h must be a vector with all zeros.
Note that the greater-than relation in (7) has to be made to match the
less-than relation in (12). This can be achieved by creating a G matrix
which has −1 in the diagonal and zero everywhere else (check this!).
• Call qp
Make the call to qp as indicated in the code sample above. qp returns a
dictionary data structure; this is why we must must use the string ’x’ as
an index to pick out the actual α values.
• Pick out the non-zero α values
If everything else is correct, only a few of the α values will be non-zero.
Since we are dealing with floating point values, however, those that are
supposed to be zero will in reality only be approximately zero. Therefore,
use a low threshold (10−5
should work fine) to determine which are to be
regarded as non-zero.
You need to save the non-zero αi
’s along with the corresponding data
points (~xi) in a separate data structure, for instance a list.
• Implement the indicator function
Implement the indicator function (equation 6) which uses the non-zero
αi
’s together with their ~xi
’s to classify new points.
4 Generating Test Data
In order to visualize the decision boundaries graphically we will restrict ourselves
to two-dimensional data, i.e. points in the plane. The data will have the form
of a vector of datapoints, where each datapoint is a triple of numbers. The first
two numbers are the x and y coordiates and the last number is the class (−1 or
1).
We will use the function random.normalvariate to generate random numbers with a normal distribution. This is suitable for our task as we can build up
more complex distributions by concatenating sample sets from multiple normal
distributions.
# Uncomment t h e l i n e bel ow t o g e n e r a t e
# t h e same d a t a s e t ove r and ove r ag a in .
# numpy . random . see d ( 1 0 0 )
cl a s sA = [ ( random . n o rm al v a ri a t e ( −1.5 , 1 ) ,
random . n o rm al v a ri a t e ( 0 . 5 , 1 ) ,
1 . 0 )
for i in range ( 5 ) ] + \
[ ( random . n o rm al v a ri a t e ( 1 . 5 , 1 ) ,
random . n o rm al v a ri a t e ( 0 . 5 , 1 ) ,
1 . 0 )
for i in range ( 5 ) ]
cl a s s B = [ ( random . n o rm al v a ri a t e ( 0 . 0 , 0 . 5 ) ,
random . n o rm al v a ri a t e ( −0.5 , 0 . 5 ) ,
−1.0)
for i in range ( 1 0 ) ]
data = cl a s sA + cl a s s B
random . s h u f f l e ( data )
This will create ten datapoints for each class. The last line randomly reorders
the points. You may want to change the values to move the clusters of data
around.
In order to see your data, you can use the plot functions from pylab. This
code will plot your two classes using blue and red dots.
pylab . h old ( True )
pylab . pl o t ( [ p [ 0 ] for p in cl a s sA ] ,
[ p [ 1 ] for p in cl a s sA ] ,
’ bo ’ )
pylab . pl o t ( [ p [ 0 ] for p in cl a s s B ] ,
[ p [ 1 ] for p in cl a s s B ] ,
’ r o ’ )
pylab . show ( )
5 Plotting the Decision Boundary
Plotting the decision boundary is a great way of visualizing how the resulting
support vector machine classifies new datapoints. The idea is to plot a curve
in the input space (which is two dimensional here), such that all points on one
side of the curve is classified as one class, and all points on the other side are
classified as the other.
pylab has a function call contour that can be used to plot contour lines
of a function given as values on a grid. Decision boundaries are special cases
of contour lines; by drawing a contour at the level where the classifier has its
threshold we will get the decision boundary.
What we will have to do is to call your indicator function at a large number
of points to see what the classification is at those points. We then draw a
contour line at level zero, but also countour lines at −1 and 1 to visualize the
margin.
xrange=numpy . a r an ge (−4, 4 , 0 . 0 5 )
yrange=numpy . a r an ge (−4, 4 , 0 . 0 5 )
g ri d=ma trix ( [ [ i n d i c a t o r ( x , y )
for y in yrange ]
for x in xrange ] )
pylab . c on t ou r (xrange , yrange , g ri d ,
( −1.0 , 0 . 0 , 1 . 0 ) ,
c o l o r s =( ’ red ’ , ’ bl a c k ’ , ’ bl u e ’ ) ,
l i n e w i d t h s =(1 , 3 , 1 ) )
If everything is correct, the margins should touch some of the datapoints.
These are the support vectors, and should correspond to datapoints with nonzero α’s. You may want to plot datapoints with non-zero α’s using a separate
kind of marker to visualize this.
6 Running and Reporting
Once you have the linear kernel running, there are a number of things you can
explore. Remember that support vector machines are especially good at finding
a reasonable decision boundary from small sets of training data.
1. Move the clusters around to make it easier or harder for the classifier to find a decent boundary. Pay attention to when the qt
function prints an error message that it can not find a solution.
2. Implement some of the non-linear kernels. you should be able to
classify very hard datasets.
3. The non-linear kernels have parameters; explore how they influence
the decision boundary. Reason about this in terms of the biasvariance trade-off.
7 Slack Implementation
You should now alter your program to include slack variables. They can quite
easily be incorporated by adding the extra contraints (equation 11). This means
that you have to extend the matrix G and vector ~h with additional rows corresponding to the αi ≤ C constraints:
• G should now be a 2N × N matrix. Note that the constraint is now
reversed, resulting in the additional lower part being negated as compared
to previously.
•
~h should now be a 2N column vector.
Repeat the previous exercise with slack variables added.
Answer the additional following questions:
1. Explore the role of the parameter C. What happens for very
large/small values?
2. Imagine that you are given data that is not easily separable. When
should you opt for more slack rather than going for a more complex
model and vice versa?
8 The End
You are now done. Please make sure you answered all the questions and printed
plots to support your reasoning.