EECS445: Introduction to Machine Learning Homework #2 solution

$30.00

Original Work ?
Category: You will Instantly receive a download link for .ZIP solution file upon Payment

Description

5/5 - (6 votes)

1 [30 points] Gaussian Discriminate Analysis
Suppose we are given a dataset {(x(i)
, t(i)
);i = 1, …, N} consisting of N independent examples, where
x(i) 2 RM are M-dimensional vectors, and t
(i) 2 {0, 1}. We will model the joint distribution of (x, t)
according to:
p(t) = t
(1 )
1t
p(x|t = 0) = 1
(2⇡)
M
2 |⌃|
1
2
exp(1
2
(x µ0)
T ⌃1(x µ0))
p(x|t = 1) = 1
(2⇡)
M
2 |⌃|
1
2
exp(1
2
(x µ1)
T ⌃1(x µ1))
Here, the parameters of our model are , ⌃, µ0, and µ1. (Note that while there are two di↵erent mean
vectors µ0 and µ1, there is only one covariance matrix ⌃.)
(a) [8 points] Suppose we have already fit , ⌃, µ0, and µ1, and now want to make a prediction at some
new query point x. Show that the posterior distribution of the label at x takes the form of a logistic
function, and can be written as
p(t = 1|x; , ⌃, µ0, µ1) = 1
1 + exp(wT x)
where w is some appropriate function of , ⌃, µ0, and µ1. (Note: To get your answer into the form
above, for this part of the problem only, you may have to redefine the x(i)
’s to be M + 1-dimensional
vectors by adding the extra coordinate x0 = 1, like we did in class.)
2
(b) [17 points] For this part of the problem only, you may assume M (the dimension of x) is 1, so that
⌃ = ⇥
2⇤ is just a real number, and likewise the determinant of ⌃ is given by |⌃| = 2. Given the
dataset, we claim that the maximum likelihood estimates of the parameters are given by
= 1
N
X
N
i=1
1{t
(i) = 1}
µ0 =
PN
i=1 1{t
(i) = 0}x(i)
PN
i=1 1{t(i) = 0}
µ1 =
PN
i=1 1{t
(i) = 1}x(i)
PN
i=1 1{t(i) = 1}
⌃ = 1
N
X
N
i=1
(x(i) µt(i) )(x(i) µt(i) )
T .
The log-likelihood of the data is
`(, µ0, µ1, ⌃) = logY
N
i=1
p(x(i)
, t(i)
; , µ0, µ1, ⌃)
= logY
N
i=1
p(x(i)
|t
(i)
; , µ0, µ1, ⌃)p(t
(i)
; )
By maximizing ` with respect to the four parameters, prove that the maximum likelihood estimates of
, ⌃, µ0, and µ1 indeed as given in the formulas above. (You may assume that there is at least one
positive and one negative example, so that the denominators in the definitions of µ0 and µ1 above are
non-zero.)
(c) [7 extra credit points] Redo part (b) without assuming M = 1.
(d) [5 points] Using the M.L.E of the parameters given in part (b), compute each parameter’s value. Load
the file q1 data.mat, which contains the variables q1x train, q1y train, q1x test, q1y test. First,
use training data to estimate parameter values. Report your estimation result. Then, use your result
and your answer in (a) to classify test data. Report your test accuracy.
Helpful Matlab command: sum, mean, repmat.
Matlab tips: given two matrices, X with dimension of n ⇤ m and Y with dimension of n ⇤ 1, in Matlab
X(Y==1,:) will return rows in X with a corresponding y value of 1.
2 [35 points] Softmax Regression via Gradient Ascent
Gradient Ascent is an algorithm used to find parameters that maximize a certain expression (contrary
Gradient Descent, that is used to minimize an expression). For some function f(w), Gradient Ascent finds
w⇤ = arg max w
f(w) according to the following pseudo-code:
3
Algorithm 1: Gradient Ascent
w⇤ random ;
repeat
w⇤ w⇤ + ↵rwf(w⇤) ;
until convergence;
return w⇤
Further, Softmax regression is a multiclass classification algorithm. Given a dataset D = {(x(1), t(1)),…,(x(N)
, t(N)
)},
where t
(i) can be one of {1, 2,…,K} (a K-class classification setting), the Softmax regression computes the
probability on example x belonging to class k as:
p(t = k|x, w) = exp
wT
k (x)

PK
j=1 exp
wT
j (x)

Where wk is a weight vector for class k. The above expression is over-parametrized, meaning that there is
more than one unique {w1, w2,…, wK} that give identical probability measures for p(t|x; w). It is possible
(and common) to obtain a unique solution by using only k 1 class weight vectors w = {w1, w2,…, wK1}
by fixing wK = 0:
p(t = k|x, w) = exp
wT
k (x)

1 + PK1
j=1 exp
wT
j (x)
; for k = {1, 2,…,K 1}
p(t = K|x, w) = 1
1 + PK1
j=1 exp
wT
j (x)

We define the likelihood over the i-th training example p(t
(i)
|x(i)
; w) as:
p(t
(i)
|x(i)
; w) = Y
K
k=1
h
p(t
(i) = k|x(i)
; w)
iI(t(i)=k)
Where I(.) is the indicator function. We define the likelihood (probability over training data) as:
L(w) = Y
N
i=1
p(t
(i)
|x(i)
; w) = Y
N
i=1
Y
K
k=1
h
p(t
(i) = k|x(i)
; w)
iI(t(i)=k)
Finally, we define the log-likelihood as:
l(w) = log L(w) = logY
N
i=1
Y
K
k=1
h
p(t
(i) = k|x(i)
; w)
iI(t(i)=k)
= X
N
i=1
X
K
k=1
log h
p(t
(i) = k|x(i)
; w)
iI(t(i)=k)
(a) [18 points] Derive the gradient ascent update rule over the log-likelihood of the training data. In other
words, derive the expression for rwml(w) for m = 1,…,K 1. Show that:
rwml(w) = X
N
i=1
(x(i)
)

I(t
(i) = m) exp
wT
m(x(i)
)

1 + PK1
j=1 exp
wT
j (x(i))

#
= X
N
i=1
(x(i)
)
h
I(t
(i) = m) p(t
(i) = m|x(i)
)
i
4
[Hints: log ab = b log(a). Further, it helps to consider the two cases separately; a case for for t
(i) = k =
m, and another for t
(i) = k 6= m, which is equivalent to using Kronecker delta1 km ]
(b) [17 points] Using the gradient computed in part (a), implement Gradient Ascent for Softmax Regression
in Matlab. Use a learning rate ↵ = 0.0005. Load the file q2 data.mat, which contains the variables
q2x train, q2t train, q2x test, q2t test. Train your classifier on the training data and report the
accuracy on the test data. Recall that Softmax Regression classifies an example x as:
y = arg max y0
p(y0
|x; w)
While you must write your own Softmax Regression implementation, feel free to double-check your
accuracy by making use of Matlab’s mnrfit(). If your accuracy is much less than the accuracy produced
by Matlab’s implementation, that means you did something wrong! In theory, you should get identical
accuracy to Matlab’s implementation (if you have a good stopping criterion, or do a large number of
iterations) as this log-likelihood function is concave. Another powerful tool that we recommend you
to use and check your answer is LIBLINEAR which is more popular and widely used. You can find
instructions and installation package here: http://www.csie.ntu.edu.tw/~cjlin/liblinear/
3 [35 points] Naive Bayes for classifying SPAM
In this problem, we will use the naive Bayes algorithm to build a SPAM classifier2.
In recent years, spam on electronic newsgroups has been an increasing problem. Here, we’ll build a
classifier to distinguish between “real” newsgroup messages, and spam messages. For this experiment, we
obtained a set of spam emails, and a set of genuine newsgroup messages. Using only the subject line and
body of each message, we’ll learn to distinguish between the spam and non-spam.
All the files for the problem are available on CTools. The text emails have been preprocessed to get them
into a form usable by naive Bayes. You can look at two sample spam emails in the files spam sample original*,
and their preprocessed forms in the files spam sample preprocessed*. The first line in the preprocessed
format is just the label and is not part of the message. The preprocessing ensures that only the message body
and subject remain in the dataset; email addresses (EMAILADDR), web addresses (HTTPADDR), currency
(DOLLAR) and numbers (NUMBER) were also replaced by the special tokens to allow them to be considered
properly in the classification process. (In this problem, we’ll going to call the features “tokens” rather than
“terms,” since some of the features will correspond to special values like EMAILADDR. You don’t have to
worry about the distinction.) The files news sample original and news sample preprocessed also give an
example of a non-spam mail.
The work to extract feature vectors out of the documents has also been done for you, so you can just
load in the design matrices (called document-term matrices in text classification) containing all the data.
In a document-term matrix, the i
th row represents the i
th document/email, and the jth column represents
the jth distinct token. Thus, the (i, j)-entry of this matrix represents the number of occurrences of the jth
token in the i
th document.
For this problem, we’ve chosen as our set of tokens considered (that is, as our vocabulary) only the
medium frequency tokens. The intuition is that tokens that occur too often or too rarely do not have much
classification value. (Examples tokens that occur very often are terms like “the,” “and,” and “of,” which
occur in so many emails and are suciently content-free that they aren’t worth modeling.) Also, terms were
stemmed using a standard stemming algorithm; basically, this means that “price,” “prices” and “priced”
have all been replaced with “price,” so that they can be treated as the same term. For a list of the tokens
used, see the file TOKENS LIST.
Since the document-term matrix is extremely sparse (has lots of zero entries), we have stored it in our
own ecient format to save space. You don’t have to worry about this format. The file read Matrix.m
1https://en.wikipedia.org/wiki/Kronecker delta 2Credit: Question adopted from http://www.stanford.edu/class/cs229/ps/ps2.pdf
5
provides the readMatrix function that reads in the document-term matrix and the correct class labels for
the various documents. Code in nb train.m and nb test.m shows how readMatrix should be called. The
documentation at the top of these two files will tell you all you need to know about the setup.
(a) [15 points] Implement a naive Bayes classier for spam classification, using the multinomial event model
and Laplace smoothing. You should use the code outline provided in nb train.m to train your parameters, and then use these parameters to classify the test set data by filling in the code in nb test.m. You
may assume that any parameters computed in nb train.m are in memory when nb test.m is executed,
and do not need to be recomputed (i.e., that nb test.m is executed immediately after nb train.m).
Train your parameters using the document-term matrix in MATRIX.TRAIN, and then report the test set
error on MATRIX.TEST.
Remark. If you implement naive Bayes in the straightforward way, you’ll note that the computed
p(x|t) = Q
i p(xi|t) often equals zero. This is because p(x|t), which is the product of many numbers less
than one, is a very small number. The standard computer representation of real numbers cannot handle
numbers that are too small, and instead rounds them o↵ to zero. (This is called “underflow.”) You’ll
have to find a way to compute naive Bayes’ predicted class labels without explicitly representing very
small numbers such as p(x|t). [Hint: Think about using logarithms.]
(b) [10 points] Intuitively, some tokens may be particularly indicative of an email being in a particular
class. We can try to get an informal sense of how indicative token i is for the SPAM class by looking at:
log p(xj = i|t = 1)
p(xj = i|t = 0) = log ✓ p(tokeni|email is SPAM)
p(tokeni|email is NOTSPAM)◆
Using the parameters fit in part (a), find the 5 tokens that are most indicative of the SPAM class
(i.e., have the highest positive value on the measure above). The numbered list of tokens in the file
TOKENS LIST should be useful for identifying the terms/tokens.
(c) [10 points] Repeat part (a), but with training sets of size ranging from 50, 100, 200, . . . , up to
1400, by using the files MATRIX.TRAIN.*. Plot the test error each time (use MATRIX.TEST as the test
data) to obtain a learning curve (test set error vs. training set size). You may need to change the call
to readMatrix in nb train.m to read the correct file each time. Which training-set size gives the best
test set error?
6