# Homework 4 solution

\$29.99

Original Work
Category:

## Description

This homework contains 3 questions. The last question requires programming. The maximum number
of points is 100 plus 20 bonus points.
1 Manual calculation of one round of EM for a GMM [30 points]
(Extended version of: Murphy Exercise 11.7) In this question we consider clustering 1D data with a mixture
of 2 Gaussians using the EM algorithm. You are given the 1-D data points x = [1 10 20].
M step
Suppose the output of the E step is the following matrix:
R =

1 0
0.4 0.6
0 1

where entry Ri,c is the probability of observation xi belonging to cluster c (the responsibility of cluster c
for data point i). You just have to compute the M step. You may state the equations for maximum likelihood
estimates of these quantities (which you should know) without proof; you just have to apply the equations to
1. [3 points] Write down the likelihood function you are trying to optimize.
2. [6 points] After performing the M step for the mixing weights π1, π2, what are the new values?
3. [6 points] After performing the M step for the means µ1 and µ2, what are the new values?
4. [6 points] After performing the M step for the standard deviations σ1 and σ2, what are the new values?
E step
Now suppose the output of the M step is the answer to the previous section. You will compute the subsequent
E step.
1. [3 points] Write down the formula for the probability of observation xi belonging to cluster c.
2. [6 points] After performing the E step, what is the new value of R?
2 HMM with tied mixtures (25 points)
(Adapted from Murphy Exercise 13.4) Consider an HMM where the observation model has the form:
p(Ot
|Xt = j, θ) = X
K
k=1
wjkN (Ot
|µk, Σk) ∀j ∈ {1, · · · , M}. (1)
In this model, we assume there are M types of hidden states. The observation for each hidden state is given
above; it is a mixture of K Gaussians. However, the Gaussians are shared between the states, and the state
influences the mixing weights but not the means and covariances. This is called semi-continuous HMM or
tied-mixture HMM.
1
1. (5 points) List all the parameters of this HMM model
2. (10 points) Derive the E step. What do we need to estimate in the E step?
3. (10 points) Derive the M step. How do we update the parameters of the model?
3 (Implementation) Linear-Chain Hidden CRF for gesture recognition (45
points)
In this question, you will implement Hidden CRF  and use it for gesture recognition.
3.1 General Hidden CRF
Consider the general form of of Hidden CRF, which assumes the following probability model:
P(y, s, X|w) = 1
Z
exp(w · φ(y, s, X)) (2)
where X is the observed variable or set of variables, y is the target, and s is hidden variable(s). The parameter
vectors of the CRF is w, which is what we need to learn. Define:
Z(y, X) = X
s
exp(w · φ(y, s, X)) (3)
Z(X) = X
y
Z(y, X) (4)
A Hidden CRF models the conditional probability of a class label y given a set of observations X by:
P(y|X, w) = X
s
P(y, s|X, w) = X
s
P(y, s, X|w)
P(X|w)
=
Z(y, X)
Z(X)
(5)
Given a training set {(Xi
, yi
)}, we train the parameters by minimizing the regularized negative log likelihood:
L(w) = λ
2
||w||2 −
1
n
Xn
i=1
log P(y
i
|Xi
, w). (6)
To minimize this objective function, we can use a gradient-based optimization method. This requires computing
the gradient for a single training example:
∂ log P(y|X, w)
∂w
=
∂ log Z(y, X)
∂w

∂ log Z(X)
∂w
(7)
We have:
∂ log Z(y, X)
∂w
=
1
Z(y, X)
X
s
exp(w · φ(y, s, X))φ(y, s, X) = X
s
P(s|y, X, w)φ(y, s, X) (8)
We can also show
∂ log Z(X)
∂w
=
X
y
X
s
P(s|y, X, w)φ(y, s, X)
!
P(y|X, w) (9)
Thus to use gradient-based optimization technique, we need an efficient way to compute:
P
s P(s|y, X, w)φ(y, s, X) and P(y|X, w) for all y.
2
3.2 Linear-chain Hidden CRF
Consider the sequence classification problem where the target label is one of m classes, y ∈ {1, · · · , m}.
The observed variables X is a sequence of T frames X = X1:T . Each frame Xt corresponds to a hidden
state st
, which belong to one of k states. For convenience of notation, we introduce a special state s0 which
always assumes value k + 1. For the linear-chain CRF, the feature vector is defined as:
φ(y, s, X) = X
T
t=1
φt(y, st
, st−1, X) (10)
Then:
X
s
P(s|y, X, w)φ(y, s, X) = X
s
P(s|y, X, w)
X
t
φt(y, st
, st−1, X) (11)
=
X
t
X
s
P(s|y, X, w)φt(y, st
, st−1, X) (12)
=
X
t
X
st,st−1
P(st
, st−1|y, X, w)φt(y, st
, st−1, X) (13)
The forward-backward algorithm:
α1(y, s1) = exp(w · φ1(y, s1, s0, X)) (14)
αi(y, si) = X
si−1
exp(w · φi(y, si
, si−1, X))αi−1(y, si−1) (15)
βT (y, sT ) = 1 (16)
βi(y, si) = X
si+1
exp(w · φi+1(y, si+1, si
, X))βi+1(si+1) (17)
And
P(st
, st−1|y, X, w) ∝ αt−1(y, st−1) exp(w · φt(y, st
, st−1, X))βt(y, st) (18)
P(y|X, w) ∝
X
st
αt(y, st)βt(y, st) (19)
Note that the above equation are proportional. You will need to compute the exact values for optimization.
In general, the above formula works for any feature function φ(y, st
, st−1, X). One possibility is to use a
feature vector that resembles HMM. Suppose the dimension of feature vector for each time step of is d, the
number of states is k, and the number of classes is m. HMM-like feature is as follows:
φt(y, st
, st−1, X) = [Φs
(:); Φl
(:); Φt
(:)] (20)
Φ
s = zeros(d, k); Φs
(:, st) = X(:, t) (21)
Φ
l = zeros(k, m); Φl
(st
, y) = 1 % Observing st
in Class y (22)
Φ
t = zeros(k, k + 1, m); Φt
(st
, st−1, y) = 1 % Transition from st−1 to st
in Class y (23)
Note that s0 is a special state, it is always k + 1. The feature vector is concatenation of three vectors. Φ
s
encodes the fact that X(:, t) is associated with state st
. Φ
l
encodes that st belongs to class y. and Φ
t
encodes
State st−1 transition to State st
in class y.
One way to understand this feature vector is to look at the weight vector:
w = [Ws
d×k
(:);Wl
k×m(:);Wt
k×(k+1)×m(:)] (24)
The value Ws
(:, i)
T x is how likely an observation x belongs to State i. The value Wl
(i, y) is how likely
we observe States i in Class y. The value Wt
(i, j, y) is how likely we observe the transition from State j to
State i in Class y.
3
3.3 What are provided
1. Data and starter code:
https://dl.dropboxusercontent.com/u/3172165/mlCourse/hw4data.zip
2. Feature computation function class: HW4 HcrfFeat
3. Forward-backward algorithm: HW4 Hcrf.forwardBackward()
4. Function to compute the ∂ log Z(y,X)
∂w
: HW4 Hcrf.cmpDerOfLogZ()
5. Utility functions for log-sum-exponential trick and converting from unnormalized log probabilities to
normalized probabilities are given in HW4 Utils.m.
6. Run HW4 Hcrf.test cmpDerOfLogZ() for a demo of how to use some other functions.
7. MHW ChaLearnData.m contains functions to load and display data. Run showRandom to display a
random sequence of data; it’s the 3D coordinates of human joints over time. Use load3ClassData()
to load data for three classes 5,6,7. We will refer to this data as 3Classes dataset. Use
load20ClassData() to load data for all 20 classes, which will be referred to as 20Classes
dataset.
3.4 What to implement
1. Implement a function that returns the objective value and derivative with respect to w of the function
defined in Eq. 6.
2. Write code to learn the parameters of HCRF by optimizing the objective function (Eq. 6). To optimize
the function, you can use fminunc of Matlab. Set the ’Algorithm’ option to ‘quasi-newton’, and this
corresponds to using BFGS algorithm. To avoid your program to run for a long time, you might want
to limit the number of iterations by setting ‘MaxIterations’ (e.g., setting to 50 or 100). You might want
to set ‘Display’ to ‘iter-detailed’ to track the progress. A good explanation of BFGS can be found here
http://aria42.com/blog/2014/12/understanding-lbfgs. If you want, you can also
use SGD for optimization.
3. Write code for prediction and evaluation
3.5 What to report
1. (5 points) Derive the formula to compute the gradient of the training objective (Eq. 6) w.r.t. w. Your
formula must be based on the output of the forward-backward algorithm (i.e., α(·, ·), β(·, ·)) and the
derivative log Z(y,Xi
)
∂w
2. (10 points) Derive the formula to compute the training objective (Eq. 6) based on α(·, ·), β(·, ·).
3. (10 points) Use λ = 0.001 and learn a HCRF (i.e., w). Report the objective value (Eq. 6) both training
and validation data for 3Classes dataset.
4. (10 points) Report: (a) the training accuracy, (b) the validation accuracy, and (c) the confusion matrix
for validation data for 3Classes dataset
5. (10 points) Use your model to predict the label for the test set of the 3Classes dataset. Save the
predicted labels on a CSV file predTestLabels 3classes.csv. The order of predicted labels
should correspond to the order of the reviews in tstData 3classes.mat. Submit this
predTestLabels 3classes.csv file on Kaggle and report classification accuracy.
4
6. (10 bonus points) Submit predTestLabels 3classes.csv and enter our Kaggle competition
for fame. You must use your Stony Brook email to register on Kaggle. We will maintain a leader
board, and the top three entries at the end of the competition (due date) will receive 10 bonus points.
The ranking is based on classification accuracy in your PDF submission file. You must use HCRF.
You can use different feature types or different number of hidden states. You can use all training data
or combine train and validation data for training. You can also increase the number of iterations of
BFGS or use different optimization algorithms.
7. (10 bonus points) Can you handle all 20 classes? Submit predTestLabels 20classes.csv
and enter our Kaggle competition for fame and bonus point. The rules and restrictions for item 4
above also apply here.
4 What to submit?
4.1 Blackboard submission
You will need to submit both your code and your answers to questions on Blackboard. Do not submit the
provided data. Put the answer file and your code in a folder named: SUBID FirstName LastName (e.g.,
10947XXXX heeyoung kwon). Zip this folder and submit the zip file on Blackboard. Your submission must
be a zip file, i.e, SUBID FirstName LastName.zip. The answer file should be named: answers.pdf, and it
should contain:
1. Answers to Question 1 and 2
4.2 Kaggle submission
For Questions 3.5.5, 3.5.6, and 3.4.7, you must submit a CSV file to get the classification accuracy from the
competition site inclass.kaggle.com/c/hw4-gesture-recognition-3classes and inclass.
kaggle.com/c/hw4-gesture-recognition-20classes. A submission file should contain two
columns: ImgId and Prediction. The file should contain a header and have the following format.
ID, Class
1, 3
2, 1
… …
(25)
A sample submission file is available from the competition site and our handout.
5 Cheating warnings
Don’t cheat. You must do the homework yourself, otherwise you won’t learn. You must use your SBU ID
as your file name for the competition. Do not fake your Stony Brook ID to bypass the submission limitation
per 24 hours. Doing so will be considered cheating.
References Cited
 S. B. Wang, A. Quattoni, L. P. Morency, D. Demirdjian, and T. Darrell. Hidden conditional random
fields for gesture recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2006.