Sale!

CSE/ISYE 6740 Homework 2 solution

$24.99 $17.49

Original Work ?

Download Details:

  • Name: assignment2-hc4geu.zip
  • Type: zip
  • Size: 3.10 MB

Category: You will Instantly receive a download link upon Payment||Click Original Work Button for Custom work

Description

5/5 - (7 votes)

1 EM for Mixture of Gaussians
Mixture of K Gaussians is represented as
p(x) = X
K
k=1
πkN (x|µk, Σk), (1)
where πk represents the probability that a data point belongs to the kth component. As it is probability, it
satisfies 0 ≤ πk ≤ 1 and P
k
πk = 1. In this problem, we are going to represent this in a slightly different
manner with explicit latent variables. Specifically, we introduce 1-of-K coding representation for latent
1https://matlab.mathworks.com/
2You can also use the environment for developing by registering online with your university email.
3Christopher M. Bishop, Pattern Recognition and Machine Learning, 2006, Springer.
1
variables z
(k) ∈ R
K for k = 1, …, K. Each z
(k)
is a binary vector of size K, with 1 only in kth element and
0 in all others. That is,
z
(1) = [1; 0; …; 0]
z
(2) = [0; 1; …; 0]
.
.
.
z
(K) = [0; 0; …; 1].
For example, if the second component generated data point x
n, its latent variable z
n is given by [0; 1; …; 0] =
z
(2). With this representation, we can express p(z) as
p(z) = Y
K
k=1
πk
zk
,
where zk indicates kth element of vector z. Also, p(x|z) can be represented similarly as
p(x|z) = Y
K
k=1
N (x|µk, Σk)
zk
.
By the sum rule of probability, (1) can be represented by
p(x) = X
z∈Z
p(z)p(x|z). (2)
where Z = {z
(1), z(2), …, z(K)}.
(a) Show that (2) is equivalent to (1). [5 pts]
(b) In reality, we do not know which component each data point is from. Thus, we estimate the
responsibility (expectation of z
n
k
) in the E-step of EM. Since z
n
k
is either 1 or 0, its expectation
is the probability for the point xn to belong to the component zk. In other words, we estimate
p(z
n
k
|xn). Derive the formula for this estimation by using Bayes rule. Note that, in the E-step,
we assume all other parameters, i.e. πk, µk, and Σk, are fixed, and we want to express p(z
n
k
|xn)
as a function of these fixed parameters. [10 pts]
(c) In the M-Step, we re-estimate parameters πk, µk, and Σk by maximizing the log-likelihood.
Given N i.i.d (Independent Identically Distributed) data samples, derive the update formula
for each parameter. Note that in order to obtain an update rule for the M-step, we fix the
responsibilities, i.e. p(z
n
k
|xn), which we have already calculated in the E-step. [15 pts]
Hint: Use Lagrange multiplier for πk to apply constraints on it.
(d) EM and K-Means [10 pts]
K-means can be viewed as a particular limit of EM for Gaussian mixture. Considering a mixture model in
which all components have covariance I, show that in the limit  → 0, maximizing the expected complete
data log-likelihood for this model is equivalent to minimizing objective function in K-means:
J =
X
N
n=1
X
K
k=1
γnkkxn − µkk
2
,
where γnk = 1 if xn belongs to the k-th cluster and γnk = 0 otherwise.
2
2 Density Estimation
Consider a histogram-like density model in which the space x is divided into fixed regions for which density
p(x) takes constant value hi over ith region, and that the volume of region i in denoted as ∆i
. Suppose we
have a set of N observations of x such that ni of these observations fall in regions i.
(a) What is the log-likelihood function? [8 pts]
(b) Derive an expression for the maximum likelihood estimator for hi. [10 pts]
Hint: This is a constrained optimization problem. Remember that p(x) must integrate to unity. Since p(x)
has constant value hi over region i, which has volume ∆i
. The normalization constraint is P
i
hi∆i = 1. Use
Lagrange multiplier by adding λ (
P
i
hi∆i − 1) to your objective function.
(c) Mark T if it is always true, and F otherwise. Briefly explain why. [12 pts]
• Non-parametric density estimation usually does not have parameters.
• The Epanechnikov kernel is the optimal kernel function for all data.
• Histogram is an efficient way to estimate density for high-dimensional data.
• Parametric density estimation assumes the shape of probability density.
3 Information Theory
In the lecture you became familiar with the concept of entropy for one random variable and mutual information. For a pair of discrete random variables X and Y with the joint distribution p(x, y), the joint entropy
H(X, Y ) is defined as
H(X, Y ) = −
X
x∈X
X
y∈Y
p(x, y) log p(x, y) (3)
which can also be expressed as
H(X, Y ) = −E[log p(X, Y )] (4)
Let X and Y take on values x1, x2, …, xr and y1, y2, …, ys respectively. Let Z also be a discrete random
variable and Z = X + Y .
(a) Prove that H(X, Y ) ≤ H(X) + H(Y ) [4 pts]
(b) Show that I(X; Y ) = H(X) + H(Y ) − H(X, Y ). [2 pts]
(c) Under what conditions does H(Z) = H(X) + H(Y ). [4 pts]
4 Programming: Text Clustering
In this problem, we will explore the use of EM algorithm for text clustering. Text clustering is a technique for
unsupervised document organization, information retrieval. We want to find how to group a set of different
text documents based on their topics. First we will analyze a model to represent the data.
3
Bag of Words
The simplest model for text documents is to understand them as a collection of words. To keep the model
simple, we keep the collection unordered, disregarding grammar and word order. What we do is counting
how often each word appears in each document and store the word counts into a matrix, where each row of
the matrix represents one document. Each column of matrix represent a specific word from the document
dictionary. Suppose we represent the set of nd documents using a matrix of word counts like this:
D1:nd =


2 6 … 4
2 4 … 0
.
.
.
.
.
.

 = T
This means that word W1 occurs twice in document D1 . Word Wnw occurs 4 times in document D1 and
not at all in document D2.
Multinomial Distribution
The simplest distribution representing a text document is multinomial distribution(Bishop Chapter 2.2).
The probability of a document Di
is:
p(Di) = Ynw
j=1
µ
Tij
j
Here, µj denotes the probability of a particular word in the text being equal to wj , Tij is the count of the
word in document. So the probability of document D1 would be p(D1) = µ
2
1
· µ
6
2
· … · µ
4
nw
·
Mixture of Multinomial Distributions
In order to do text clustering, we want to use a mixture of multinomial distributions, so that each topic has
a particular multinomial distribution associated with it, and each document is a mixture of different topics.
We define p(c) = πc as the mixture coefficient of a document containing topic c, and each topic is modeled
by a multinomial distribution p(Di
|c) with parameters µjc, then we can write each document as a mixture
over topics as
p(Di) = Xnc
c=1
p(Di
|c)p(c) = Xnc
c=1
πc
Ynw
j=1
µ
Tij
jc
EM for Mixture of Multinomials
In order to cluster a set of documents, we need to fit this mixture model to data. In this problem, the EM
algorithm can be used for fitting mixture models. This will be a simple topic model for documents. Each
topic is a multinomial distribution over words (a mixture component). EM algorithm for such a topic model,
which consists of iterating the following steps:
1. Expectation
Compute the expectation of document Di belonging to cluster c:
γic =
πc
Qnw
j=1 µ
Tij
jc
Pnc
c=1 πc
Qnw
j=1 µ
Tij
jc
4
2. Maximization
Update the mixture parameters, i.e. the probability of a word being Wj in cluster (topic) c, as well as
prior probability of each cluster.
µjc =
Pnd
i=1
P
γicTij
nd
i=1
Pmw
l=1 γicTil
πc =
1
nd
Xnd
i=1
γic
Task [20 pts]
Implement the algorithm and run on the toy dataset data.mat. You can find detailed description about the
data in the homework2.m file. Observe the results and compare them with the provided true clusters each
document belongs to. Report the evaluation (e.g. accuracy) of your implementation.
Hint: We already did the word counting for you, so the data file only contains a count matrix like the
one shown above. For the toy dataset, set the number of clusters nc = 4. You will need to initialize the
parameters. Try several different random initial values for the probability of a word being Wj in topic c,
µjc. Make sure you normalized it. Make sure that you should not use the true cluster information during
your learning phase.
Extra Credit: Realistic Topic Models [20pts]
The above model assumes all the words in a document belongs to some topic at the same time. However,
in real world datasets, it is more likely that some words in the documents belong to one topic while other
words belong to some other topics. For example, in a news report, some words may talk about “Ebola” and
“health”, while others may mention “administration” and “congress”. In order to model this phenomenon,
we should model each word as a mixture of possible topics.
Specifically, consider the log-likelihood of the joint distribution of document and words
L =
X
d∈D
X
w∈W
Tdw log P(d, w), (5)
where Tdw is the counts of word w in the document d. This count matrix is provided as input.
The joint distribution of a specific document and a specific word is modeled as a mixture
P(d, w) = X
z∈Z
P(z)P(w|z)P(d|z), (6)
where P(z) is the mixture proportion, P(w|z) is the distribution over the vocabulary for the z-th topic, and
P(d|z) is the probability of the document for the z-th topic. And these are the parameters for the model.
The E-step calculates the posterior distribution of the latent variable conditioned on all other variables
P(z|d, w) = P(z)P(w|z)P(d|z)
P
z
0 P(z
0)P(w|z
0)P(d|z
0)
. (7)
In the M-step, we maximizes the expected complete log-likelihood with respect to the parameters, and
get the following update rules
P(w|z) =
P
d
TdwP(z|d, w)
P
w0
P
d
Tdw0P(z|d, w0)
(8)
P(d|z) =
P
w
TdwP(z|d, w)
P
d0
P
w
Td0wP(z|d
0
, w)
(9)
P(z) =
P
d
P
w
TdwP(z|d, w)
P
z
0
P
d0
P
w0 Td0w0P(z
0
|d
0
, w0)
. (10)
5
Task
Implement EM for maximum likelihood estimation and cluster the text data provided in the nips.mat file
you downloaded. You can print out the top key words for the topics/clusters by using the show topics.m
utility. It takes two parameters: 1) your learned conditional distribution matrix, i.e., P(w|z) and 2) a cell
array of words that corresponds to the vocabulary. You can find the cell array wl in the nips.mat file. Try
different values of k and see which values produce sensible topics. In assessing your code, we will use another
dataset and observe the produces topics.
6