## Description

1 EM for Mixture of Gaussians

Mixture of K Gaussians is represented as

p(x) = X

K

k=1

πkN (x|µk, Σk), (1)

where πk represents the probability that a data point belongs to the kth component. As it is probability, it

satisfies 0 ≤ πk ≤ 1 and P

k

πk = 1. In this problem, we are going to represent this in a slightly different

manner with explicit latent variables. Specifically, we introduce 1-of-K coding representation for latent

variables z

(k) ∈ R

K for k = 1, …, K. Each z

(k)

is a binary vector of size K, with 1 only in kth element and

0 in all others. That is,

z

(1) = [1; 0; …; 0]

z

(2) = [0; 1; …; 0]

.

.

.

z

(K) = [0; 0; …; 1].

For example, if the second component generated data point x

i

, its latent variable z

i

is given by [0; 1; …; 0] =

z

(2). With this representation, we can express p(z) as

p(z) = Y

K

k=1

πk

zk

,

1Christopher M. Bishop, Pattern Recognition and Machine Learning, 2006, Springer.

1

where zk indicates kth element of vector z.

Also, p(x|z) can be represented similarly as

p(x|z) = Y

K

k=1

N (x|µk, Σk)

zk

.

By the sum rule of probability, (1) can be represented by

p(x) = X

z∈Z

p(z)p(x|z). (2)

where Z = {z

(1), z(2), …, z(K)}.

(a) Show that (2) is equivalent to (1). [5 pts]

(b) In reality, we do not know which component each data point is from. Thus, we estimate the

responsibility (expectation of z

i

k

) in the E-step of EM. Since z

i

k

is either 1 or 0, its expectation

is the probability for the point xi to belong to the component zk. In other words, we estimate

p(z

i

k = 1|xi). [5 pts]

Note that, in the E-step, we assume all other parameters, i.e. πk, µk, and Σk, are fixed, and we want to

express p(z

i

k

|xi) as a function of these fixed parameters.

Hint: Derive the formula for this estimation by using Bayes rule.

(c) In the M-Step, we re-estimate parameters πk, µk, and Σk by maximizing the log-likelihood.

Given N i.i.d (Independent Identically Distributed) data samples x1, …, xN , write down the log

likelihood function, and derive the update formula for each parameter. [15 pts]

Note that in order to obtain an update rule for the M-step, we fix the responsibilities, i.e. p(z

i

k

|xi), which

we have already calculated in the E-step.

Hint: Use Lagrange multiplier for πk to apply constraints on it.

(d) Establish the convergence of the EM algorithm by showing that the log likelihood function

will be nondecreasing in each iteration of this algorithm. [10 pts]

Hint: Use Jensen’s inequality: for concave function f(x), we have f(

P

i αixi) ≥

P

i αif(xi), where P

i αi =

1, αi ≥ 0.

(e) EM and K-Means [5 pts]

K-means can be viewed as a particular limit of EM for Gaussian mixture. Briefly describe the expectation

and maximization steps in K-Means Algorithm.

2 Density Estimation

(a) Given a sequence of m independently and identically distributed (iid) data points D = {x

1

, x

2

, x

3

, x

4

,

… , x

m} which represent a coin flip experiment of a biased coin where x

i∈{0,1}, 0 indicating tails and 1

indicating heads , find the Maximum likelihood estimation (MLE) for the biased coin flip landing in heads

using bernoulli distribution. Bernoulli distribution is given by:

P(x | θ) = θ

x

(1 − θ)

1−x

(3)

2

where θ is the probability of landing in heads when flipping a biased coin. [6 pts]

(b) Given D = {x

1

, x

2

, x

3

, x

4

, … , x

m} set of m iid data points, and x

i ∈ R find the MLE for Gaussian distribution and in turn find the values of mean(µ) and variance(σ

2

). Gaussian Distribution is given

by:

p(x | µ, σ) = 1

(2π)

1/2σ

exp(

−1

2σ

2

(x − µ)

2

) (4)

[8 pts]

(c) Given D = {x

1

, x

2

, x

3

, x

4

, … , x

m} set of m iid data points, x

i ∈ [0, 1) split the samples into n

bins B1 to Bn with C1 to Cn data points in each bin respectively. Find the 1-D Histogram density estimation function p(x) w.r.t n, m, Cj and Bj . What are the conditions the density estimation function should

follow in order to be valid? Prove that the Histogram density estimation function found in the previous step

is valid. [6 pts]

(d) Answer the following questions. [10 pts]

• What are parametric and non parametric models? Give 2 examples of parametric and non-parametric

models.

• What are the 4 properties that smoothing kernel functions should follow?

• What are the two main drawbacks of histograms? How is kernel density estimation better than histograms?

• Give 4 examples of smoothing kernel functions.

• What are the differences between parametric and non parametric models?

3 Information Theory

In the lecture you became familiar with the concept of entropy for one random variable and mutual information. One property of mutual information is I(X, Z) ≥ 0, where the larger the value, the greater the

relationship between the two variables.

Let X and Y take on values x1, x2, …, xr and y1, y2, …, ys respectively. Let Z also be a discrete random

variable and Z = X + Y .

(a) Show that H(Z|X) = H(Y |X). Argue that if X, Y are independent, then H(Y ) ≤ H(Z) and H(X) ≤

H(Z). Thus the addition of independent random variables adds uncertainty.[4 pts]

(b) Give an example of (necessarily dependent) random variables X, Y in which H(X) > H(Z) and

H(Y ) > H(Z). [2 pts]

(c) Using following two properties of entropy:

(1) Chain rule: H(X, Y ) = H(X) + H(Y |X) = H(Y ) + H(X|Y )

3

(2) Entropy of a function of random variable is no larger than that of the random variable, which is

H(g(V )) ≤ H(V ).

Please show that under what conditions does H(Z) = H(X) + H(Y ). [4 pts]

4 Programming: Text Clustering

In this problem, we will explore the use of EM algorithm for text clustering. Text clustering is a technique for

unsupervised document organization, information retrieval. We want to find how to group a set of different

text documents based on their topics. First we will analyze a model to represent the data.

Bag of Words

The simplest model for text documents is to understand them as a collection of words. To keep the model

simple, we keep the collection unordered, disregarding grammar and word order. What we do is counting

how often each word appears in each document and store the word counts into a matrix, where each row of

the matrix represents one document. Each column of matrix represent a specific word from the document

dictionary. Suppose we represent the set of nd documents using a matrix of word counts like this:

D1:nd =

2 6 … 4

2 4 … 0

.

.

.

.

.

.

= T

This means that word W1 occurs twice in document D1 . Word Wnw occurs 4 times in document D1 and

not at all in document D2.

Multinomial Distribution

The simplest distribution representing a text document is multinomial distribution(Bishop Chapter 2.2).

The probability of a document Di

is:

p(Di) = Ynw

j=1

µ

Tij

j

Here, µj denotes the probability of a particular word in the text being equal to wj , Tij is the count of the

word in document. So the probability of document D1 would be p(D1) = µ

2

1

· µ

6

2

· … · µ

4

nw

·

Mixture of Multinomial Distributions

In order to do text clustering, we want to use a mixture of multinomial distributions, so that each topic has

a particular multinomial distribution associated with it, and each document is a mixture of different topics.

We define p(c) = πc as the mixture coefficient of a document containing topic c, and each topic is modeled

by a multinomial distribution p(Di

|c) with parameters µjc, then we can write each document as a mixture

over topics as

p(Di) = Xnc

c=1

p(Di

|c)p(c) = Xnc

c=1

πc

Ynw

j=1

µ

Tij

jc

4

EM for Mixture of Multinomials

In order to cluster a set of documents, we need to fit this mixture model to data. In this problem, the EM

algorithm can be used for fitting mixture models. This will be a simple topic model for documents. Each

topic is a multinomial distribution over words (a mixture component). EM algorithm for such a topic model,

which consists of iterating the following steps:

1. Expectation

Compute the expectation of document Di belonging to cluster c:

γic =

πc

Qnw

j=1 µ

Tij

jc

Pnc

c=1 πc

Qnw

j=1 µ

Tij

jc

2. Maximization

Update the mixture parameters, i.e. the probability of a word being Wj in cluster (topic) c, as well as

prior probability of each cluster.

µjc =

Pnd

i=1

P

γicTij

nd

i=1

Pmw

l=1 γicTil

πc =

1

nd

Xnd

i=1

γic

Task [20 pts]

Implement the algorithm and run on the toy dataset data.mat. You can find detailed description about the

data in the homework2.m file. Observe the results and compare them with the provided true clusters each

document belongs to. Report the evaluation (e.g. accuracy) of your implementation.

Hint: We already did the word counting for you, so the data file only contains a count matrix like the

one shown above. For the toy dataset, set the number of clusters nc = 4. You will need to initialize the

parameters. Try several different random initial values for the probability of a word being Wj in topic c,

µjc. Make sure you normalized it. Make sure that you should not use the true cluster information during

your learning phase.

Extra Credit: Realistic Topic Models [20pts]

The above model assumes all the words in a document belongs to some topic at the same time. However,

in real world datasets, it is more likely that some words in the documents belong to one topic while other

words belong to some other topics. For example, in a news report, some words may talk about “Ebola” and

“health”, while others may mention “administration” and “congress”. In order to model this phenomenon,

we should model each word as a mixture of possible topics.

Specifically, consider the log-likelihood of the joint distribution of document and words

L =

X

d∈D

X

w∈W

Tdw log P(d, w), (5)

where Tdw is the counts of word w in the document d. This count matrix is provided as input.

The joint distribution of a specific document and a specific word is modeled as a mixture

P(d, w) = X

z∈Z

P(z)P(w|z)P(d|z), (6)

5

where P(z) is the mixture proportion, P(w|z) is the distribution over the vocabulary for the z-th topic, and

P(d|z) is the probability of the document for the z-th topic. And these are the parameters for the model.

The E-step calculates the posterior distribution of the latent variable conditioned on all other variables

P(z|d, w) = P(z)P(w|z)P(d|z)

P

z

′ P(z

′)P(w|z

′)P(d|z

′)

. (7)

In the M-step, we maximizes the expected complete log-likelihood with respect to the parameters, and

get the following update rules

P(w|z) =

P

d

TdwP(z|d, w)

P

w′

P

d

Tdw′P(z|d, w′)

(8)

P(d|z) =

P

w

TdwP(z|d, w)

P

d′

P

w

Td′wP(z|d

′

, w)

(9)

P(z) =

P

d

P

w

TdwP(z|d, w)

P

z

′

P

d′

P

w′ Td′w′P(z

′

|d

′

, w′)

. (10)

Task

Implement EM for maximum likelihood estimation and cluster the text data provided in the nips.mat file

you downloaded. You can print out the top key words for the topics/clusters by using the show topics.m

utility. It takes two parameters: 1) your learned conditional distribution matrix, i.e., P(w|z) and 2) a cell

array of words that corresponds to the vocabulary. You can find the cell array wl in the nips.mat file. Try

different values of k and see which values produce sensible topics. In assessing your code, we will use another

dataset and observe the produces topics.

6