## Description

This homework contains 3 questions. The last question requires programming. The maximum number

of points is 100 plus 20 bonus points.

1 Manual calculation of one round of EM for a GMM [30 points]

(Extended version of: Murphy Exercise 11.7) In this question we consider clustering 1D data with a mixture

of 2 Gaussians using the EM algorithm. You are given the 1-D data points x = [1 10 20].

M step

Suppose the output of the E step is the following matrix:

R =

1 0

0.4 0.6

0 1

where entry Ri,c is the probability of observation xi belonging to cluster c (the responsibility of cluster c

for data point i). You just have to compute the M step. You may state the equations for maximum likelihood

estimates of these quantities (which you should know) without proof; you just have to apply the equations to

this data set. You may leave your answer in fractional form. Show your work.

1. [3 points] Write down the likelihood function you are trying to optimize.

2. [6 points] After performing the M step for the mixing weights π1, π2, what are the new values?

3. [6 points] After performing the M step for the means µ1 and µ2, what are the new values?

4. [6 points] After performing the M step for the standard deviations σ1 and σ2, what are the new values?

E step

Now suppose the output of the M step is the answer to the previous section. You will compute the subsequent

E step.

1. [3 points] Write down the formula for the probability of observation xi belonging to cluster c.

2. [6 points] After performing the E step, what is the new value of R?

2 HMM with tied mixtures (25 points)

(Adapted from Murphy Exercise 13.4) Consider an HMM where the observation model has the form:

p(Ot

|Xt = j, θ) = X

K

k=1

wjkN (Ot

|µk, Σk) ∀j ∈ {1, · · · , M}. (1)

In this model, we assume there are M types of hidden states. The observation for each hidden state is given

above; it is a mixture of K Gaussians. However, the Gaussians are shared between the states, and the state

influences the mixing weights but not the means and covariances. This is called semi-continuous HMM or

tied-mixture HMM.

1

1. (5 points) List all the parameters of this HMM model

2. (10 points) Derive the E step. What do we need to estimate in the E step?

3. (10 points) Derive the M step. How do we update the parameters of the model?

3 (Implementation) Linear-Chain Hidden CRF for gesture recognition (45

points)

In this question, you will implement Hidden CRF [1] and use it for gesture recognition.

3.1 General Hidden CRF

Consider the general form of of Hidden CRF, which assumes the following probability model:

P(y, s, X|w) = 1

Z

exp(w · φ(y, s, X)) (2)

where X is the observed variable or set of variables, y is the target, and s is hidden variable(s). The parameter

vectors of the CRF is w, which is what we need to learn. Define:

Z(y, X) = X

s

exp(w · φ(y, s, X)) (3)

Z(X) = X

y

Z(y, X) (4)

A Hidden CRF models the conditional probability of a class label y given a set of observations X by:

P(y|X, w) = X

s

P(y, s|X, w) = X

s

P(y, s, X|w)

P(X|w)

=

Z(y, X)

Z(X)

(5)

Given a training set {(Xi

, yi

)}, we train the parameters by minimizing the regularized negative log likelihood:

L(w) = λ

2

||w||2 −

1

n

Xn

i=1

log P(y

i

|Xi

, w). (6)

To minimize this objective function, we can use a gradient-based optimization method. This requires computing

the gradient for a single training example:

∂ log P(y|X, w)

∂w

=

∂ log Z(y, X)

∂w

−

∂ log Z(X)

∂w

(7)

We have:

∂ log Z(y, X)

∂w

=

1

Z(y, X)

X

s

exp(w · φ(y, s, X))φ(y, s, X) = X

s

P(s|y, X, w)φ(y, s, X) (8)

We can also show

∂ log Z(X)

∂w

=

X

y

X

s

P(s|y, X, w)φ(y, s, X)

!

P(y|X, w) (9)

Thus to use gradient-based optimization technique, we need an efficient way to compute:

P

s P(s|y, X, w)φ(y, s, X) and P(y|X, w) for all y.

2

3.2 Linear-chain Hidden CRF

Consider the sequence classification problem where the target label is one of m classes, y ∈ {1, · · · , m}.

The observed variables X is a sequence of T frames X = X1:T . Each frame Xt corresponds to a hidden

state st

, which belong to one of k states. For convenience of notation, we introduce a special state s0 which

always assumes value k + 1. For the linear-chain CRF, the feature vector is defined as:

φ(y, s, X) = X

T

t=1

φt(y, st

, st−1, X) (10)

Then:

X

s

P(s|y, X, w)φ(y, s, X) = X

s

P(s|y, X, w)

X

t

φt(y, st

, st−1, X) (11)

=

X

t

X

s

P(s|y, X, w)φt(y, st

, st−1, X) (12)

=

X

t

X

st,st−1

P(st

, st−1|y, X, w)φt(y, st

, st−1, X) (13)

The forward-backward algorithm:

α1(y, s1) = exp(w · φ1(y, s1, s0, X)) (14)

αi(y, si) = X

si−1

exp(w · φi(y, si

, si−1, X))αi−1(y, si−1) (15)

βT (y, sT ) = 1 (16)

βi(y, si) = X

si+1

exp(w · φi+1(y, si+1, si

, X))βi+1(si+1) (17)

And

P(st

, st−1|y, X, w) ∝ αt−1(y, st−1) exp(w · φt(y, st

, st−1, X))βt(y, st) (18)

P(y|X, w) ∝

X

st

αt(y, st)βt(y, st) (19)

Note that the above equation are proportional. You will need to compute the exact values for optimization.

In general, the above formula works for any feature function φ(y, st

, st−1, X). One possibility is to use a

feature vector that resembles HMM. Suppose the dimension of feature vector for each time step of is d, the

number of states is k, and the number of classes is m. HMM-like feature is as follows:

φt(y, st

, st−1, X) = [Φs

(:); Φl

(:); Φt

(:)] (20)

Φ

s = zeros(d, k); Φs

(:, st) = X(:, t) (21)

Φ

l = zeros(k, m); Φl

(st

, y) = 1 % Observing st

in Class y (22)

Φ

t = zeros(k, k + 1, m); Φt

(st

, st−1, y) = 1 % Transition from st−1 to st

in Class y (23)

Note that s0 is a special state, it is always k + 1. The feature vector is concatenation of three vectors. Φ

s

encodes the fact that X(:, t) is associated with state st

. Φ

l

encodes that st belongs to class y. and Φ

t

encodes

State st−1 transition to State st

in class y.

One way to understand this feature vector is to look at the weight vector:

w = [Ws

d×k

(:);Wl

k×m(:);Wt

k×(k+1)×m(:)] (24)

The value Ws

(:, i)

T x is how likely an observation x belongs to State i. The value Wl

(i, y) is how likely

we observe States i in Class y. The value Wt

(i, j, y) is how likely we observe the transition from State j to

State i in Class y.

3

3.3 What are provided

To help you start, the following functions are provided:

1. Data and starter code:

https://dl.dropboxusercontent.com/u/3172165/mlCourse/hw4data.zip

2. Feature computation function class: HW4 HcrfFeat

3. Forward-backward algorithm: HW4 Hcrf.forwardBackward()

4. Function to compute the ∂ log Z(y,X)

∂w

: HW4 Hcrf.cmpDerOfLogZ()

5. Utility functions for log-sum-exponential trick and converting from unnormalized log probabilities to

normalized probabilities are given in HW4 Utils.m.

6. Run HW4 Hcrf.test cmpDerOfLogZ() for a demo of how to use some other functions.

7. MHW ChaLearnData.m contains functions to load and display data. Run showRandom to display a

random sequence of data; it’s the 3D coordinates of human joints over time. Use load3ClassData()

to load data for three classes 5,6,7. We will refer to this data as 3Classes dataset. Use

load20ClassData() to load data for all 20 classes, which will be referred to as 20Classes

dataset.

3.4 What to implement

1. Implement a function that returns the objective value and derivative with respect to w of the function

defined in Eq. 6.

2. Write code to learn the parameters of HCRF by optimizing the objective function (Eq. 6). To optimize

the function, you can use fminunc of Matlab. Set the ’Algorithm’ option to ‘quasi-newton’, and this

corresponds to using BFGS algorithm. To avoid your program to run for a long time, you might want

to limit the number of iterations by setting ‘MaxIterations’ (e.g., setting to 50 or 100). You might want

to set ‘Display’ to ‘iter-detailed’ to track the progress. A good explanation of BFGS can be found here

http://aria42.com/blog/2014/12/understanding-lbfgs. If you want, you can also

use SGD for optimization.

3. Write code for prediction and evaluation

3.5 What to report

1. (5 points) Derive the formula to compute the gradient of the training objective (Eq. 6) w.r.t. w. Your

formula must be based on the output of the forward-backward algorithm (i.e., α(·, ·), β(·, ·)) and the

derivative log Z(y,Xi

)

∂w

2. (10 points) Derive the formula to compute the training objective (Eq. 6) based on α(·, ·), β(·, ·).

3. (10 points) Use λ = 0.001 and learn a HCRF (i.e., w). Report the objective value (Eq. 6) both training

and validation data for 3Classes dataset.

4. (10 points) Report: (a) the training accuracy, (b) the validation accuracy, and (c) the confusion matrix

for validation data for 3Classes dataset

5. (10 points) Use your model to predict the label for the test set of the 3Classes dataset. Save the

predicted labels on a CSV file predTestLabels 3classes.csv. The order of predicted labels

should correspond to the order of the reviews in tstData 3classes.mat. Submit this

predTestLabels 3classes.csv file on Kaggle and report classification accuracy.

4

6. (10 bonus points) Submit predTestLabels 3classes.csv and enter our Kaggle competition

for fame. You must use your Stony Brook email to register on Kaggle. We will maintain a leader

board, and the top three entries at the end of the competition (due date) will receive 10 bonus points.

The ranking is based on classification accuracy in your PDF submission file. You must use HCRF.

You can use different feature types or different number of hidden states. You can use all training data

or combine train and validation data for training. You can also increase the number of iterations of

BFGS or use different optimization algorithms.

7. (10 bonus points) Can you handle all 20 classes? Submit predTestLabels 20classes.csv

and enter our Kaggle competition for fame and bonus point. The rules and restrictions for item 4

above also apply here.

4 What to submit?

4.1 Blackboard submission

You will need to submit both your code and your answers to questions on Blackboard. Do not submit the

provided data. Put the answer file and your code in a folder named: SUBID FirstName LastName (e.g.,

10947XXXX heeyoung kwon). Zip this folder and submit the zip file on Blackboard. Your submission must

be a zip file, i.e, SUBID FirstName LastName.zip. The answer file should be named: answers.pdf, and it

should contain:

1. Answers to Question 1 and 2

2. Answers to Question 3.5

4.2 Kaggle submission

For Questions 3.5.5, 3.5.6, and 3.4.7, you must submit a CSV file to get the classification accuracy from the

competition site inclass.kaggle.com/c/hw4-gesture-recognition-3classes and inclass.

kaggle.com/c/hw4-gesture-recognition-20classes. A submission file should contain two

columns: ImgId and Prediction. The file should contain a header and have the following format.

ID, Class

1, 3

2, 1

… …

(25)

A sample submission file is available from the competition site and our handout.

5 Cheating warnings

Don’t cheat. You must do the homework yourself, otherwise you won’t learn. You must use your SBU ID

as your file name for the competition. Do not fake your Stony Brook ID to bypass the submission limitation

per 24 hours. Doing so will be considered cheating.

References Cited

[1] S. B. Wang, A. Quattoni, L. P. Morency, D. Demirdjian, and T. Darrell. Hidden conditional random

fields for gesture recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern

Recognition, 2006.