## Description

1 [30 points] Gaussian Discriminate Analysis

Suppose we are given a dataset {(x(i)

, t(i)

);i = 1, …, N} consisting of N independent examples, where

x(i) 2 RM are M-dimensional vectors, and t

(i) 2 {0, 1}. We will model the joint distribution of (x, t)

according to:

p(t) = t

(1 )

1t

p(x|t = 0) = 1

(2⇡)

M

2 |⌃|

1

2

exp(1

2

(x µ0)

T ⌃1(x µ0))

p(x|t = 1) = 1

(2⇡)

M

2 |⌃|

1

2

exp(1

2

(x µ1)

T ⌃1(x µ1))

Here, the parameters of our model are , ⌃, µ0, and µ1. (Note that while there are two di↵erent mean

vectors µ0 and µ1, there is only one covariance matrix ⌃.)

(a) [8 points] Suppose we have already fit , ⌃, µ0, and µ1, and now want to make a prediction at some

new query point x. Show that the posterior distribution of the label at x takes the form of a logistic

function, and can be written as

p(t = 1|x; , ⌃, µ0, µ1) = 1

1 + exp(wT x)

where w is some appropriate function of , ⌃, µ0, and µ1. (Note: To get your answer into the form

above, for this part of the problem only, you may have to redefine the x(i)

’s to be M + 1-dimensional

vectors by adding the extra coordinate x0 = 1, like we did in class.)

2

(b) [17 points] For this part of the problem only, you may assume M (the dimension of x) is 1, so that

⌃ = ⇥

2⇤ is just a real number, and likewise the determinant of ⌃ is given by |⌃| = 2. Given the

dataset, we claim that the maximum likelihood estimates of the parameters are given by

= 1

N

X

N

i=1

1{t

(i) = 1}

µ0 =

PN

i=1 1{t

(i) = 0}x(i)

PN

i=1 1{t(i) = 0}

µ1 =

PN

i=1 1{t

(i) = 1}x(i)

PN

i=1 1{t(i) = 1}

⌃ = 1

N

X

N

i=1

(x(i) µt(i) )(x(i) µt(i) )

T .

The log-likelihood of the data is

`(, µ0, µ1, ⌃) = logY

N

i=1

p(x(i)

, t(i)

; , µ0, µ1, ⌃)

= logY

N

i=1

p(x(i)

|t

(i)

; , µ0, µ1, ⌃)p(t

(i)

; )

By maximizing ` with respect to the four parameters, prove that the maximum likelihood estimates of

, ⌃, µ0, and µ1 indeed as given in the formulas above. (You may assume that there is at least one

positive and one negative example, so that the denominators in the definitions of µ0 and µ1 above are

non-zero.)

(c) [7 extra credit points] Redo part (b) without assuming M = 1.

(d) [5 points] Using the M.L.E of the parameters given in part (b), compute each parameter’s value. Load

the file q1 data.mat, which contains the variables q1x train, q1y train, q1x test, q1y test. First,

use training data to estimate parameter values. Report your estimation result. Then, use your result

and your answer in (a) to classify test data. Report your test accuracy.

Helpful Matlab command: sum, mean, repmat.

Matlab tips: given two matrices, X with dimension of n ⇤ m and Y with dimension of n ⇤ 1, in Matlab

X(Y==1,:) will return rows in X with a corresponding y value of 1.

2 [35 points] Softmax Regression via Gradient Ascent

Gradient Ascent is an algorithm used to find parameters that maximize a certain expression (contrary

Gradient Descent, that is used to minimize an expression). For some function f(w), Gradient Ascent finds

w⇤ = arg max w

f(w) according to the following pseudo-code:

3

Algorithm 1: Gradient Ascent

w⇤ random ;

repeat

w⇤ w⇤ + ↵rwf(w⇤) ;

until convergence;

return w⇤

Further, Softmax regression is a multiclass classification algorithm. Given a dataset D = {(x(1), t(1)),…,(x(N)

, t(N)

)},

where t

(i) can be one of {1, 2,…,K} (a K-class classification setting), the Softmax regression computes the

probability on example x belonging to class k as:

p(t = k|x, w) = exp

wT

k (x)

PK

j=1 exp

wT

j (x)

Where wk is a weight vector for class k. The above expression is over-parametrized, meaning that there is

more than one unique {w1, w2,…, wK} that give identical probability measures for p(t|x; w). It is possible

(and common) to obtain a unique solution by using only k 1 class weight vectors w = {w1, w2,…, wK1}

by fixing wK = 0:

p(t = k|x, w) = exp

wT

k (x)

1 + PK1

j=1 exp

wT

j (x)

; for k = {1, 2,…,K 1}

p(t = K|x, w) = 1

1 + PK1

j=1 exp

wT

j (x)

We define the likelihood over the i-th training example p(t

(i)

|x(i)

; w) as:

p(t

(i)

|x(i)

; w) = Y

K

k=1

h

p(t

(i) = k|x(i)

; w)

iI(t(i)=k)

Where I(.) is the indicator function. We define the likelihood (probability over training data) as:

L(w) = Y

N

i=1

p(t

(i)

|x(i)

; w) = Y

N

i=1

Y

K

k=1

h

p(t

(i) = k|x(i)

; w)

iI(t(i)=k)

Finally, we define the log-likelihood as:

l(w) = log L(w) = logY

N

i=1

Y

K

k=1

h

p(t

(i) = k|x(i)

; w)

iI(t(i)=k)

= X

N

i=1

X

K

k=1

log h

p(t

(i) = k|x(i)

; w)

iI(t(i)=k)

(a) [18 points] Derive the gradient ascent update rule over the log-likelihood of the training data. In other

words, derive the expression for rwml(w) for m = 1,…,K 1. Show that:

rwml(w) = X

N

i=1

(x(i)

)

”

I(t

(i) = m) exp

wT

m(x(i)

)

1 + PK1

j=1 exp

wT

j (x(i))

#

= X

N

i=1

(x(i)

)

h

I(t

(i) = m) p(t

(i) = m|x(i)

)

i

4

[Hints: log ab = b log(a). Further, it helps to consider the two cases separately; a case for for t

(i) = k =

m, and another for t

(i) = k 6= m, which is equivalent to using Kronecker delta1 km ]

(b) [17 points] Using the gradient computed in part (a), implement Gradient Ascent for Softmax Regression

in Matlab. Use a learning rate ↵ = 0.0005. Load the file q2 data.mat, which contains the variables

q2x train, q2t train, q2x test, q2t test. Train your classifier on the training data and report the

accuracy on the test data. Recall that Softmax Regression classifies an example x as:

y = arg max y0

p(y0

|x; w)

While you must write your own Softmax Regression implementation, feel free to double-check your

accuracy by making use of Matlab’s mnrfit(). If your accuracy is much less than the accuracy produced

by Matlab’s implementation, that means you did something wrong! In theory, you should get identical

accuracy to Matlab’s implementation (if you have a good stopping criterion, or do a large number of

iterations) as this log-likelihood function is concave. Another powerful tool that we recommend you

to use and check your answer is LIBLINEAR which is more popular and widely used. You can find

instructions and installation package here: http://www.csie.ntu.edu.tw/~cjlin/liblinear/

3 [35 points] Naive Bayes for classifying SPAM

In this problem, we will use the naive Bayes algorithm to build a SPAM classifier2.

In recent years, spam on electronic newsgroups has been an increasing problem. Here, we’ll build a

classifier to distinguish between “real” newsgroup messages, and spam messages. For this experiment, we

obtained a set of spam emails, and a set of genuine newsgroup messages. Using only the subject line and

body of each message, we’ll learn to distinguish between the spam and non-spam.

All the files for the problem are available on CTools. The text emails have been preprocessed to get them

into a form usable by naive Bayes. You can look at two sample spam emails in the files spam sample original*,

and their preprocessed forms in the files spam sample preprocessed*. The first line in the preprocessed

format is just the label and is not part of the message. The preprocessing ensures that only the message body

and subject remain in the dataset; email addresses (EMAILADDR), web addresses (HTTPADDR), currency

(DOLLAR) and numbers (NUMBER) were also replaced by the special tokens to allow them to be considered

properly in the classification process. (In this problem, we’ll going to call the features “tokens” rather than

“terms,” since some of the features will correspond to special values like EMAILADDR. You don’t have to

worry about the distinction.) The files news sample original and news sample preprocessed also give an

example of a non-spam mail.

The work to extract feature vectors out of the documents has also been done for you, so you can just

load in the design matrices (called document-term matrices in text classification) containing all the data.

In a document-term matrix, the i

th row represents the i

th document/email, and the jth column represents

the jth distinct token. Thus, the (i, j)-entry of this matrix represents the number of occurrences of the jth

token in the i

th document.

For this problem, we’ve chosen as our set of tokens considered (that is, as our vocabulary) only the

medium frequency tokens. The intuition is that tokens that occur too often or too rarely do not have much

classification value. (Examples tokens that occur very often are terms like “the,” “and,” and “of,” which

occur in so many emails and are suciently content-free that they aren’t worth modeling.) Also, terms were

stemmed using a standard stemming algorithm; basically, this means that “price,” “prices” and “priced”

have all been replaced with “price,” so that they can be treated as the same term. For a list of the tokens

used, see the file TOKENS LIST.

Since the document-term matrix is extremely sparse (has lots of zero entries), we have stored it in our

own ecient format to save space. You don’t have to worry about this format. The file read Matrix.m

1https://en.wikipedia.org/wiki/Kronecker delta 2Credit: Question adopted from http://www.stanford.edu/class/cs229/ps/ps2.pdf

5

provides the readMatrix function that reads in the document-term matrix and the correct class labels for

the various documents. Code in nb train.m and nb test.m shows how readMatrix should be called. The

documentation at the top of these two files will tell you all you need to know about the setup.

(a) [15 points] Implement a naive Bayes classier for spam classification, using the multinomial event model

and Laplace smoothing. You should use the code outline provided in nb train.m to train your parameters, and then use these parameters to classify the test set data by filling in the code in nb test.m. You

may assume that any parameters computed in nb train.m are in memory when nb test.m is executed,

and do not need to be recomputed (i.e., that nb test.m is executed immediately after nb train.m).

Train your parameters using the document-term matrix in MATRIX.TRAIN, and then report the test set

error on MATRIX.TEST.

Remark. If you implement naive Bayes in the straightforward way, you’ll note that the computed

p(x|t) = Q

i p(xi|t) often equals zero. This is because p(x|t), which is the product of many numbers less

than one, is a very small number. The standard computer representation of real numbers cannot handle

numbers that are too small, and instead rounds them o↵ to zero. (This is called “underflow.”) You’ll

have to find a way to compute naive Bayes’ predicted class labels without explicitly representing very

small numbers such as p(x|t). [Hint: Think about using logarithms.]

(b) [10 points] Intuitively, some tokens may be particularly indicative of an email being in a particular

class. We can try to get an informal sense of how indicative token i is for the SPAM class by looking at:

log p(xj = i|t = 1)

p(xj = i|t = 0) = log ✓ p(tokeni|email is SPAM)

p(tokeni|email is NOTSPAM)◆

Using the parameters fit in part (a), find the 5 tokens that are most indicative of the SPAM class

(i.e., have the highest positive value on the measure above). The numbered list of tokens in the file

TOKENS LIST should be useful for identifying the terms/tokens.

(c) [10 points] Repeat part (a), but with training sets of size ranging from 50, 100, 200, . . . , up to

1400, by using the files MATRIX.TRAIN.*. Plot the test error each time (use MATRIX.TEST as the test

data) to obtain a learning curve (test set error vs. training set size). You may need to change the call

to readMatrix in nb train.m to read the correct file each time. Which training-set size gives the best

test set error?

6