## Description

1 [25 points] Direct Construction of Valid Kernels

In class, we saw that by choosing a kernel K(x, z) = (x)T (z), we can implicitly map data to a high

dimensional space, and have the SVM algorithm work in that space. One way to generate kernels is to

explicitly define the mapping to a higher dimensional space, and then work out the corresponding K.

However in this question we are interested in direct construction of kernels. I.e., suppose we have a

function K(x, z) that we think gives an appropriate similarity measure for our learning problem, and we are

considering plugging K into a kernelized algorithm (like SVM) as the kernel function. However for K(x, z)

to be a valid kernel, it must correspond to an inner product in some higher dimensional space resulting

from some feature mapping . Mercers theorem tells us that K(x, z) is a (Mercer) kernel if and only if for

any finite set {x(1), …, x(N)

}, the matrix K is symmetric and positive semidefinite, where the square matrix

K 2 RN ⇥ RN is given by Kij = K(x(i)

, x(j)

).

Now here comes the question: let K1, K2 be kernels over RN ⇥RN , let a 2 R+ be a positive real number,

let f : RN ! R be a real-valued function, let : RN ! Rd be a function mapping from RN to Rd, let K3 be

a kernel over Rd ⇥ Rd, and let p : R ! R be a polynomial with positive coecients.

For each of the functions K below, state whether it is necessarily a kernel. If you think it is, prove it; if

you think it isnt, give a counter-example.

(a) K(x, z) = K1(x, z) + K2(x, z)

(b) K(x, z) = K1(x, z) K2(x, z)

(c) K(x, z) = aK1(x, z)

(d) K(x, z) = aK1(x, z)

(e) K(x, z) = K1(x, z)K2(x, z)

(f) K(x, z) = f(x)f(z)

(g) K(x, z) = K3((x), (z))

1

(h) K(x, z) = p(K1(x, z))

[Hint: For part (e), the answer is that the K there is indeed a kernel. You still have to prove it, though.

(This one may be harder than the rest.) This result may also be useful for another part of the problem.]

(i) [4 points extra credit] Prove that the gaussian Kernel, K(x, z) = exp(||xz||2/22) can be expressed

as (x)T (z), where (.) is an infinite-dimensional vector.

[Hint: ||x z||2 = xT x + zT z 2xT x. Consider using Power Series]

2 [25 points] Kernelizing the Perceptron

Let there be a binary classification problem with t 2 {0, 1}. The perceptron uses hypotheses of the form

y(x; w) = f(wT x), where f(z) = I[z 0]. In this problem we will consider a stochastic gradient descent-like

implementation of the perceptron algorithm where each update to the parameters w is made using only one

training example. However, unlike stochastic gradient descent, the perceptron algorithm will only make one

pass through the entire training set. The update rule for this version of the perceptron algorithm is given by

w(i+1) := w(i) + ↵[t

(i+1) y(x(i+1); w(i)

)]x(i+1)

where w(i) is the value of the parameters after the algorithm has seen the first i training examples. Prior to

seeing any training examples, w(0) is initialized to 0.

Let K be a Mercer kernel corresponding to some very high-dimensional feature mapping . Suppose is

so high-dimensional (say, infinite-dimensional) that it’s infeasible to ever represent (x) explicitly. Describe

how you would apply the ”kernel trick” to the perceptron to make it work in the high-dimensional feature

space , but without ever explicitly computing (x). [Note: You dont have to worry about the intercept

term. If you like, think of as having the property that 0(x) = 1 so that this is taken care of.] Your

description should specify:

(a) How you will (implicitly) represent the high-dimensional parameter vector w(i)

, including how the initial

value w(0) = 0 is represented (note that w(i) is now a vector whose dimension is the same as the feature

vectors (x));

(b) How you will eciently make a prediction on a new input x(i+1). I.e., how you will compute y(x(i+1); w(i)

) =

f(w(i)T (x(i+1))), using your representation of w(i)

; and

(c) How you will modify the update rule given above to perform an update to w on a new training example

(x(i+1), t(i+1)); i.e., using the update rule corresponding to the feature mapping :

w(i+1) := w(i) + ↵

h

t

(i+1) y((x(i+1)); w(i)

)

i

(x(i+1))

[Note: If you prefer, you are also welcome to do this problem using the convention of labels t 2 {1, 1}, and

f(z) = sign(z) = 1 if z 0, 1 otherwise.]

3 [32 points] Implementing Soft Margin SVM by Optimizing Primal Objective

Support Vector Machines (SVM) is a discriminative model for classification. Although it is possible to develop

SVMs that do K class classifications, we are going to restrict ourselves to binary classification in this question,

where the class label is either +1 (positive) or 1 (negative). SVM is not a probabilistic algorithm. In other

words, in its usual construction, it does not optimize a probability measure as a likelihood. SVM tries to

find the “best” hyperplane that maximally1 separates the positive class from the negative class.

1maximally means increasing the geometric margin. Review lecture slides if you need a refresher. In addition, the construction

of SVMs will be reviewed this Friday’s discussion section

2

Recall that the objective function for maximizing the soft margin is equivalent to the following minimization problem:

min

w,b

1

2

||w||2 + CX

N

i=1

⇠i

subject to t

(i)

(wT x(i) + b) 1 ⇠i i = 1,…,N

⇠i 0 i = 1,…,N

The above is known as the Primal Objective of SVM2. Notice the two constraints on ⇠i. It means

that ⇠i must satisfy both of those conditions, while minimizing the sum of ⇠i’s times C. The constrained

minimization is equivalent to following minimization:

min

w,b

1

2

||w||2 + CX

N

i=1

max ⇣

0, 1 t

(i)

(wT x(i) + b)

⌘

You will be working with the above minimization in this problem. However, please think to yourself

why combining the minimization of ⇠i’s and the two constraints became a max as shown above. To shorten

notation, lets define our loss function E(w, b) as:

E(w, b) = 1

2

||w||2 + CX

N

i=1

max ⇣

0, 1 t

(i)

(wT x(i) + b)

⌘

(a) [6 points] Find the derivatives of the loss function with respect to our parameters. Show that:

rwE(w, b) = w CX

N

i=1

I

h

t

(i)

(wT x(i) + b) < 1

i

t

(i)

x(i)

@

@b

E(w, b) = CX

N

i=1

I

h

t

(i)

(wT x(i) + b) < 1

i

t

(i)

Where I[.] is the indicator function.

(b) [10 points] Implement the SVM algorithm using batch gradient descent. In previous assignments, you

have implemented gradient descent while minimizing over one variable. Minimizing over two variables

(w, b) is not di↵erent. Both gradients are computed from current parameter values, and then parameters

are simultanously updated. Note: it is a common mistake to implement gradient descent over multiple

parameters by updating the first parameter, then computing the derivative w.r.t second parameter using

the updated value of the first parameter3. The pseudo code for optimizing w and b is given by:

Algorithm 1: SVM Batch Gradient Descent

w⇤ 0 ;

b⇤ 0 ;

for j=1 to NumIterations do

wgrad rwE(w⇤, b⇤) ;

bgrad @

@bE(w⇤, b⇤) ;

w⇤ w⇤ ↵(j) wgrad ;

b⇤ b⇤ 0.01 ↵(j) bgrad ;

end

return w⇤

2In EECS445, we don’t dive deep into the Dual Objective of SVM. The Dual formulation is covered in depth in EECS545 3In fact, updating one parameter then computing the gradient of the second parameter using the updated value of the first

parameter, is a di↵erent optimization algorithm, known as Coordinate Descent

3

The learning rate ↵(.) is now a function of time (iteration number) rather than a constant. This allows

us to define a decaying learning rate as:

↵(j) = ⌘0

1 + j ⌘0

Throughout this question, set ⌘0 = 0.5 and the slack cost C = 5. Loading the file q3.mat loads the variables q3x train, q3t train, q3x test, and q3t test. Run your implementation of SVM Batch Gradient

Descent over the training data 6 times, once for each NumIterations = {5, 50, 100, 1000, 5000, 6000}.

For each run, report your trained parameters (w, b) and the test classification accuracy.

(c) [4 points] In Stochastic Gradient Descent, we compute the gradients and update our parameters for

every training example (rather than batch updating). To do stochastic gradient descent properly, we

need to define a loss function per example (the loss function E(w, b), above, is defined over the entire

training set). We can rewrite the loss function above as:

E(w, b) = 1

2

||w||2 + CX

N

i=1

max ⇣

0, 1 t

(i)

(wT x(i) + b)

⌘

= X

N

i=1

1

2N ||w||2 + C max ⇣

0, 1 t

(i)

(wT x(i) + b)

⌘

Expressing the error function as one summation allows us to define an error term per training example

like:

E(i)

(w, b) = 1

2N ||w||2 + C max ⇣

0, 1 t

(i)

(wT x(i) + b)

⌘

then, E(w, b) = X

N

i=1

E(i)

(w, b)

Find the expressions for rwE(i)

(w, b) and @

@bE(i)

(w, b)

(d) [8 points] Implement the SVM algorithm using stochastic gradient descent (SGD). The pseudo-code

looks like:

Algorithm 2: SVM Stochastic Gradient Descent

w⇤ 0 ;

b⇤ 0 ;

for j=1 to NumIterations do

for i = 1 to N do

wgrad rwE(i)

(w⇤, b⇤) ;

bgrad @

@bE(i)

(w⇤, b⇤) ;

w⇤ w⇤ ↵(j) wgrad ;

b⇤ b⇤ 0.01 ↵(j) bgrad ;

end

end

return w⇤

Use the same ↵(.), ⌘0, C as part (b). Run your implementation of SVM Stochastic Gradient Descent

over the training data 6 times, once for each NumIterations = {5, 50, 100, 1000, 5000, 6000}. For each

run, report your trained parameters (w, b) and the test classification accuracy.

4

(e) [4 points] What can you conclude about the convergence rate of Stochastic gradient descent (SGD)

versus gradient descent? How did you make this conclusion?

4 [18 points] Using Liblinear for SVM Classification

In this problem, we will use SVM to build a SPAM classifier. You have seen this dataset in the previous

homework! Rather than building your implementation from scratch, you will use lib linear. 4.

In recent years, spam on electronic newsgroups has been an increasing problem. Here, we’ll build a

classifier to distinguish between “real” newsgroup messages, and spam messages. For this experiment, we

obtained a set of spam emails, and a set of genuine newsgroup messages. Using only the subject line and

body of each message, we’ll learn to distinguish between the spam and non-spam.

The problem files live inside the folder q4 in the data for this assignment. You can look at two sample spam

emails in the files spam sample original*, and their preprocessed forms in the files spam sample preprocessed*.

The first line in the preprocessed format is just the label and is not part of the message. The preprocessing

ensures that only the message body and subject remain in the dataset; email addresses (EMAILADDR), web

addresses (HTTPADDR), currency (DOLLAR) and numbers (NUMBER) were also replaced by the special

tokens to allow them to be considered properly in the classification process. (In this problem, we’ll going to

call the features “tokens” rather than “terms,” since some of the features will correspond to special values

like EMAILADDR. You don’t have to worry about the distinction.) The files news sample original and

news sample preprocessed also give an example of a non-spam mail.

The work to extract feature vectors out of the documents has also been done for you, so you can just

load in the design matrices (called document-term matrices in text classification) containing all the data.

In a document-term matrix, the i

th row represents the i

th document/email, and the jth column represents

the jth distinct token. Thus, the (i, j)-entry of this matrix represents the number of occurrences of the jth

token in the i

th document.

For this problem, we’ve chosen as our set of tokens considered (that is, as our vocabulary) only the

medium frequency tokens. The intuition is that tokens that occur too often or too rarely do not have much

classification value. (Examples tokens that occur very often are terms like “the,” “and,” and “of,” which

occur in so many emails and are suciently content-free that they aren’t worth modeling.) Also, terms were

stemmed using a standard stemming algorithm; basically, this means that “price,” “prices” and “priced”

have all been replaced with “price,” so that they can be treated as the same term. For a list of the tokens

used, see the file TOKENS LIST.

Since the document-term matrix is extremely sparse (has lots of zero entries), we have stored it in our

own ecient format to save space. Similar to the previous homework, the function full() is used to convert

the sparse matrix into a normal matlab matrix. This is already done for you! Modify the indicated sections

of svm train.m and svm test.m

(a) [15 points] Train an SVM on this dataset using the LIBLINEAR SVM library, available for download

from

http://www.csie.ntu.edu.tw/~cjlin/liblinear/. This implements an SVM using a linear kernel. Like the Naive Bayes implementation, an outline for your code is provided in svm train.m and

svm test.m.

See README.txt for instructions for downloading and installing LIBLINEAR. Train an SVM with training

set sizes 50, 100, 200, . . . , 1400, by using the file MATRIX.TRAIN. E.g. to use a training set of 50, use

the first 50 points from MATRIX.TRAIN. Plot the test error each time, using MATRIX.TEST as the test

data. Use the LIBLINEAR default options when training and testing. You don’t need to try di↵erent

parameter values.

HINT: Running LIBLINEAR in Matlab on Windows can be buggy, depending on which version of Windows you run. However, there are command line programs (train and predict) that you can run (without using MATLAB) which are located in liblinear-1.94/windows for Windows and liblinear-1.94/

4Credit: Question adopted from http://www.stanford.edu/class/cs229/ps/ps2.pdf

5

for Linux/Unix. If you do it this way, please include the commands that you run from the command line in your solution. The manual for usage of the command line programs is available from:

http://www.csie.ntu.edu.tw/ cjlin/papers/liblinear.pdf

Note that you must still convert the data into SVM form (using -1,1 instead of 0,1).

(b) [3 points] How do naive Bayes (from previous assignment) and Support Vector Machines compare (in

terms of generalization error) as a function of the training set size?

5 [Extra Credit 8 points, Optional] Naive Bayes Classifier Using

NLTK

In this problem you will be using the Naive Bayes classifier from the NLTK package. You can find detailed

information about NLTK at http://www.nltk.org/

1 Installing NLTK

a You can find instructions of how to install NLTK in di↵erent OS here at:

http://www.nltk.org/install.html

b Notice if you are using Mac OS you may run into a situation where you numpy and pandas packages

conflict. In this case you will need to uninstall your pandas by typing (in a terminal) : pip uninstall

pandas and then reinstall it using a di↵erent version by typing (also in a terminal): pip install

pandas==0.13.1

c The commands in step 2 may need administrator’s authorization to execute, you may need to use the

sudo command.

2 Run the code

a Switch directory to the folder of NLTKNaiveBayes which is included in the data package on CTools.

b A template is named NLTKNB template.py. There are three lines of code that need to be added in near

the end of the file. Your first task will be to fill in the missing code. After that please rename the

template file following the naming convention we use for homework programs.

c After adding in the code, you can either run the file in any IDE or type in command line python to

start python first, then type execfile(hfilenamei) to run your code, where hfilenamei is the name

of the file with your code added in.

d After your code executes successfully, you will see the results of accuracies of testing spam emails and

ham(non-spam) emails and a list of the 20 (can be changed to any other number less than the total

number of features, 1448) most informative words.

3 What this code does

This code takes use of the NLTK functions and assembles a Naive Bayes classifier. It uses the same data

set we used for problem 3 in homework 2 and problem 4 in homework 3. However, rather than binary

matrices, the data we used in this problem is a collection of text files which contains only the feature

words from the token list. The text files are generated according to the feature matrix you downloaded

from CTools, MATRIX.TRAIN and MATRIX.TEST.

4 What to turn in

a The three lines of code you added in and a screenshot of your program’s OUTPUT. (3 points)

b Comment on the performance of the code using NLTK. (You may compare it to your own code from

problem 3 in homework2). (2 points)

c Explain the term ”most informative words” based on your output. (2 points)

6

Helpful commands:

1. command line commands: pip install hpackage namei, sudo pip install hpackage namei, pip

uninstall hpackage namei, pip install hpackage namei==hversion numberi

2. python commands: import, execfile

7