# EECS445: Introduction to Machine Learning Homework #3 solution

\$30.00

Original Work
Category:

## Description

5/5 - (4 votes)

1 [25 points] Direct Construction of Valid Kernels
In class, we saw that by choosing a kernel K(x, z) = (x)T (z), we can implicitly map data to a high
dimensional space, and have the SVM algorithm work in that space. One way to generate kernels is to
explicitly define the mapping to a higher dimensional space, and then work out the corresponding K.
However in this question we are interested in direct construction of kernels. I.e., suppose we have a
function K(x, z) that we think gives an appropriate similarity measure for our learning problem, and we are
considering plugging K into a kernelized algorithm (like SVM) as the kernel function. However for K(x, z)
to be a valid kernel, it must correspond to an inner product in some higher dimensional space resulting
from some feature mapping . Mercers theorem tells us that K(x, z) is a (Mercer) kernel if and only if for
any finite set {x(1), …, x(N)
}, the matrix K is symmetric and positive semidefinite, where the square matrix
K 2 RN ⇥ RN is given by Kij = K(x(i)
, x(j)
).
Now here comes the question: let K1, K2 be kernels over RN ⇥RN , let a 2 R+ be a positive real number,
let f : RN ! R be a real-valued function, let : RN ! Rd be a function mapping from RN to Rd, let K3 be
a kernel over Rd ⇥ Rd, and let p : R ! R be a polynomial with positive coecients.
For each of the functions K below, state whether it is necessarily a kernel. If you think it is, prove it; if
you think it isnt, give a counter-example.
(a) K(x, z) = K1(x, z) + K2(x, z)
(b) K(x, z) = K1(x, z) K2(x, z)
(c) K(x, z) = aK1(x, z)
(d) K(x, z) = aK1(x, z)
(e) K(x, z) = K1(x, z)K2(x, z)
(f) K(x, z) = f(x)f(z)
(g) K(x, z) = K3((x), (z))
1
(h) K(x, z) = p(K1(x, z))
[Hint: For part (e), the answer is that the K there is indeed a kernel. You still have to prove it, though.
(This one may be harder than the rest.) This result may also be useful for another part of the problem.]
(i) [4 points extra credit] Prove that the gaussian Kernel, K(x, z) = exp(||xz||2/22) can be expressed
as (x)T (z), where (.) is an infinite-dimensional vector.
[Hint: ||x z||2 = xT x + zT z 2xT x. Consider using Power Series]
2 [25 points] Kernelizing the Perceptron
Let there be a binary classification problem with t 2 {0, 1}. The perceptron uses hypotheses of the form
y(x; w) = f(wT x), where f(z) = I[z 0]. In this problem we will consider a stochastic gradient descent-like
implementation of the perceptron algorithm where each update to the parameters w is made using only one
training example. However, unlike stochastic gradient descent, the perceptron algorithm will only make one
pass through the entire training set. The update rule for this version of the perceptron algorithm is given by
w(i+1) := w(i) + ↵[t
(i+1) y(x(i+1); w(i)
)]x(i+1)
where w(i) is the value of the parameters after the algorithm has seen the first i training examples. Prior to
seeing any training examples, w(0) is initialized to 0.
Let K be a Mercer kernel corresponding to some very high-dimensional feature mapping . Suppose is
so high-dimensional (say, infinite-dimensional) that it’s infeasible to ever represent (x) explicitly. Describe
how you would apply the ”kernel trick” to the perceptron to make it work in the high-dimensional feature
space , but without ever explicitly computing (x). [Note: You dont have to worry about the intercept
term. If you like, think of as having the property that 0(x) = 1 so that this is taken care of.] Your
description should specify:
(a) How you will (implicitly) represent the high-dimensional parameter vector w(i)
, including how the initial
value w(0) = 0 is represented (note that w(i) is now a vector whose dimension is the same as the feature
vectors (x));
(b) How you will eciently make a prediction on a new input x(i+1). I.e., how you will compute y(x(i+1); w(i)
) =
f(w(i)T (x(i+1))), using your representation of w(i)
; and
(c) How you will modify the update rule given above to perform an update to w on a new training example
(x(i+1), t(i+1)); i.e., using the update rule corresponding to the feature mapping :
w(i+1) := w(i) + ↵
h
t
(i+1) y((x(i+1)); w(i)
)
i
(x(i+1))
[Note: If you prefer, you are also welcome to do this problem using the convention of labels t 2 {1, 1}, and
f(z) = sign(z) = 1 if z 0, 1 otherwise.]
3 [32 points] Implementing Soft Margin SVM by Optimizing Primal Objective
Support Vector Machines (SVM) is a discriminative model for classification. Although it is possible to develop
SVMs that do K class classifications, we are going to restrict ourselves to binary classification in this question,
where the class label is either +1 (positive) or 1 (negative). SVM is not a probabilistic algorithm. In other
words, in its usual construction, it does not optimize a probability measure as a likelihood. SVM tries to
find the “best” hyperplane that maximally1 separates the positive class from the negative class.
1maximally means increasing the geometric margin. Review lecture slides if you need a refresher. In addition, the construction
of SVMs will be reviewed this Friday’s discussion section
2
Recall that the objective function for maximizing the soft margin is equivalent to the following minimization problem:
min
w,b
1
2
||w||2 + CX
N
i=1
⇠i
subject to t
(i)
(wT x(i) + b) 1 ⇠i i = 1,…,N
⇠i 0 i = 1,…,N
The above is known as the Primal Objective of SVM2. Notice the two constraints on ⇠i. It means
that ⇠i must satisfy both of those conditions, while minimizing the sum of ⇠i’s times C. The constrained
minimization is equivalent to following minimization:
min
w,b
1
2
||w||2 + CX
N
i=1
max ⇣
0, 1 t
(i)
(wT x(i) + b)

You will be working with the above minimization in this problem. However, please think to yourself
why combining the minimization of ⇠i’s and the two constraints became a max as shown above. To shorten
notation, lets define our loss function E(w, b) as:
E(w, b) = 1
2
||w||2 + CX
N
i=1
max ⇣
0, 1 t
(i)
(wT x(i) + b)

(a) [6 points] Find the derivatives of the loss function with respect to our parameters. Show that:
rwE(w, b) = w CX
N
i=1
I
h
t
(i)
(wT x(i) + b) < 1
i
t
(i)
x(i)
@
@b
E(w, b) = CX
N
i=1
I
h
t
(i)
(wT x(i) + b) < 1
i
t
(i)
Where I[.] is the indicator function.
(b) [10 points] Implement the SVM algorithm using batch gradient descent. In previous assignments, you
have implemented gradient descent while minimizing over one variable. Minimizing over two variables
(w, b) is not di↵erent. Both gradients are computed from current parameter values, and then parameters
are simultanously updated. Note: it is a common mistake to implement gradient descent over multiple
parameters by updating the first parameter, then computing the derivative w.r.t second parameter using
the updated value of the first parameter3. The pseudo code for optimizing w and b is given by:
Algorithm 1: SVM Batch Gradient Descent
w⇤ 0 ;
b⇤ 0 ;
for j=1 to NumIterations do
wgrad rwE(w⇤, b⇤) ;
@bE(w⇤, b⇤) ;
w⇤ w⇤ ↵(j) wgrad ;
b⇤ b⇤ 0.01 ↵(j) bgrad ;
end
return w⇤
2In EECS445, we don’t dive deep into the Dual Objective of SVM. The Dual formulation is covered in depth in EECS545 3In fact, updating one parameter then computing the gradient of the second parameter using the updated value of the first
parameter, is a di↵erent optimization algorithm, known as Coordinate Descent
3
The learning rate ↵(.) is now a function of time (iteration number) rather than a constant. This allows
us to define a decaying learning rate as:
↵(j) = ⌘0
1 + j ⌘0
Throughout this question, set ⌘0 = 0.5 and the slack cost C = 5. Loading the file q3.mat loads the variables q3x train, q3t train, q3x test, and q3t test. Run your implementation of SVM Batch Gradient
Descent over the training data 6 times, once for each NumIterations = {5, 50, 100, 1000, 5000, 6000}.
For each run, report your trained parameters (w, b) and the test classification accuracy.
(c) [4 points] In Stochastic Gradient Descent, we compute the gradients and update our parameters for
every training example (rather than batch updating). To do stochastic gradient descent properly, we
need to define a loss function per example (the loss function E(w, b), above, is defined over the entire
training set). We can rewrite the loss function above as:
E(w, b) = 1
2
||w||2 + CX
N
i=1
max ⇣
0, 1 t
(i)
(wT x(i) + b)

= X
N
i=1
 1
2N ||w||2 + C max ⇣
0, 1 t
(i)
(wT x(i) + b)

Expressing the error function as one summation allows us to define an error term per training example
like:
E(i)
(w, b) = 1
2N ||w||2 + C max ⇣
0, 1 t
(i)
(wT x(i) + b)

then, E(w, b) = X
N
i=1
E(i)
(w, b)
Find the expressions for rwE(i)
(w, b) and @
@bE(i)
(w, b)
(d) [8 points] Implement the SVM algorithm using stochastic gradient descent (SGD). The pseudo-code
looks like:
Algorithm 2: SVM Stochastic Gradient Descent
w⇤ 0 ;
b⇤ 0 ;
for j=1 to NumIterations do
for i = 1 to N do
(w⇤, b⇤) ;
@bE(i)
(w⇤, b⇤) ;
w⇤ w⇤ ↵(j) wgrad ;
b⇤ b⇤ 0.01 ↵(j) bgrad ;
end
end
return w⇤
Use the same ↵(.), ⌘0, C as part (b). Run your implementation of SVM Stochastic Gradient Descent
over the training data 6 times, once for each NumIterations = {5, 50, 100, 1000, 5000, 6000}. For each
run, report your trained parameters (w, b) and the test classification accuracy.
4
(e) [4 points] What can you conclude about the convergence rate of Stochastic gradient descent (SGD)
versus gradient descent? How did you make this conclusion?
4 [18 points] Using Liblinear for SVM Classification
In this problem, we will use SVM to build a SPAM classifier. You have seen this dataset in the previous
homework! Rather than building your implementation from scratch, you will use lib linear. 4.
In recent years, spam on electronic newsgroups has been an increasing problem. Here, we’ll build a
classifier to distinguish between “real” newsgroup messages, and spam messages. For this experiment, we
obtained a set of spam emails, and a set of genuine newsgroup messages. Using only the subject line and
body of each message, we’ll learn to distinguish between the spam and non-spam.
The problem files live inside the folder q4 in the data for this assignment. You can look at two sample spam
emails in the files spam sample original*, and their preprocessed forms in the files spam sample preprocessed*.
The first line in the preprocessed format is just the label and is not part of the message. The preprocessing
ensures that only the message body and subject remain in the dataset; email addresses (EMAILADDR), web
addresses (HTTPADDR), currency (DOLLAR) and numbers (NUMBER) were also replaced by the special
tokens to allow them to be considered properly in the classification process. (In this problem, we’ll going to
call the features “tokens” rather than “terms,” since some of the features will correspond to special values
like EMAILADDR. You don’t have to worry about the distinction.) The files news sample original and
news sample preprocessed also give an example of a non-spam mail.
The work to extract feature vectors out of the documents has also been done for you, so you can just
load in the design matrices (called document-term matrices in text classification) containing all the data.
In a document-term matrix, the i
th row represents the i
th document/email, and the jth column represents
the jth distinct token. Thus, the (i, j)-entry of this matrix represents the number of occurrences of the jth
token in the i
th document.
For this problem, we’ve chosen as our set of tokens considered (that is, as our vocabulary) only the
medium frequency tokens. The intuition is that tokens that occur too often or too rarely do not have much
classification value. (Examples tokens that occur very often are terms like “the,” “and,” and “of,” which
occur in so many emails and are suciently content-free that they aren’t worth modeling.) Also, terms were
stemmed using a standard stemming algorithm; basically, this means that “price,” “prices” and “priced”
have all been replaced with “price,” so that they can be treated as the same term. For a list of the tokens
used, see the file TOKENS LIST.
Since the document-term matrix is extremely sparse (has lots of zero entries), we have stored it in our
own ecient format to save space. Similar to the previous homework, the function full() is used to convert
the sparse matrix into a normal matlab matrix. This is already done for you! Modify the indicated sections
of svm train.m and svm test.m
(a) [15 points] Train an SVM on this dataset using the LIBLINEAR SVM library, available for download
from
http://www.csie.ntu.edu.tw/~cjlin/liblinear/. This implements an SVM using a linear kernel. Like the Naive Bayes implementation, an outline for your code is provided in svm train.m and
svm test.m.
See README.txt for instructions for downloading and installing LIBLINEAR. Train an SVM with training
set sizes 50, 100, 200, . . . , 1400, by using the file MATRIX.TRAIN. E.g. to use a training set of 50, use
the first 50 points from MATRIX.TRAIN. Plot the test error each time, using MATRIX.TEST as the test
data. Use the LIBLINEAR default options when training and testing. You don’t need to try di↵erent
parameter values.
HINT: Running LIBLINEAR in Matlab on Windows can be buggy, depending on which version of Windows you run. However, there are command line programs (train and predict) that you can run (without using MATLAB) which are located in liblinear-1.94/windows for Windows and liblinear-1.94/
4Credit: Question adopted from http://www.stanford.edu/class/cs229/ps/ps2.pdf
5
for Linux/Unix. If you do it this way, please include the commands that you run from the command line in your solution. The manual for usage of the command line programs is available from:
http://www.csie.ntu.edu.tw/ cjlin/papers/liblinear.pdf
Note that you must still convert the data into SVM form (using -1,1 instead of 0,1).
(b) [3 points] How do naive Bayes (from previous assignment) and Support Vector Machines compare (in
terms of generalization error) as a function of the training set size?
5 [Extra Credit 8 points, Optional] Naive Bayes Classifier Using
NLTK
In this problem you will be using the Naive Bayes classifier from the NLTK package. You can find detailed
information about NLTK at http://www.nltk.org/
1 Installing NLTK
a You can find instructions of how to install NLTK in di↵erent OS here at:
http://www.nltk.org/install.html
b Notice if you are using Mac OS you may run into a situation where you numpy and pandas packages
conflict. In this case you will need to uninstall your pandas by typing (in a terminal) : pip uninstall
pandas and then reinstall it using a di↵erent version by typing (also in a terminal): pip install
pandas==0.13.1
c The commands in step 2 may need administrator’s authorization to execute, you may need to use the
sudo command.
2 Run the code
a Switch directory to the folder of NLTKNaiveBayes which is included in the data package on CTools.
b A template is named NLTKNB template.py. There are three lines of code that need to be added in near
the end of the file. Your first task will be to fill in the missing code. After that please rename the
template file following the naming convention we use for homework programs.
c After adding in the code, you can either run the file in any IDE or type in command line python to
start python first, then type execfile(hfilenamei) to run your code, where hfilenamei is the name
of the file with your code added in.
d After your code executes successfully, you will see the results of accuracies of testing spam emails and
ham(non-spam) emails and a list of the 20 (can be changed to any other number less than the total
number of features, 1448) most informative words.
3 What this code does
This code takes use of the NLTK functions and assembles a Naive Bayes classifier. It uses the same data
set we used for problem 3 in homework 2 and problem 4 in homework 3. However, rather than binary
matrices, the data we used in this problem is a collection of text files which contains only the feature
words from the token list. The text files are generated according to the feature matrix you downloaded
from CTools, MATRIX.TRAIN and MATRIX.TEST.
4 What to turn in
a The three lines of code you added in and a screenshot of your program’s OUTPUT. (3 points)
b Comment on the performance of the code using NLTK. (You may compare it to your own code from
problem 3 in homework2). (2 points)
c Explain the term ”most informative words” based on your output. (2 points)
6