Description
1 Objective
The objective is to implement the algorithms for evaluation and decoding of Hidden Markov
Models (HMMs) combined with Gaussian emission probability distributions. The lab is designed
in Python, but the same functions can be obtained in Matlab/Octave or using the Hidden Markov
Toolkit (HTK).
2 Task
The overall task is to implement and test methods for isolated word recognition:
• implement the forward algorithm,
• use it compute the log likelihood of spoken utterances given a Gaussian HMM
• perform isolated word recognition
• compare the Gaussian HMM to a GMM (Gaussian Mixture Model)
• implement the Viterbi algorithm, and use it to compute Viterbi path and likelihood
• compare and comment Viterbi and Forward likelihoods
• Optional: implement the Baum-Welch algorithm and use it to update the model parameters
In order to pass the lab, you will need to follow the steps described in this document, and
produce a report where you describe your work and answer the questions asked here. The
report should be submitted in the Assignment section in the course Web on KTH Social https:
//www.kth.se/social/course/DT2118/. One submission should be done for each group, clearly
stating all the members of that group.
3 Data
The speech data used in this lab is similar but not the same as in Lab 1. You can load the array
containing speech utterances with the commands:
import numpy as np
tidigits = np.load(‘lab2_tidigits.npz’)[‘tidigits’]
The data contains also MFCC features (mfcc key), but you are welcome to test how the algorithms perform on the MFCCs computed with your own code from Lab 1. Refer to the
instructions to Lab 1 for more information about the data structures.
Additionally, the lab2_models.npz file contains the parameters of the models you are going
to use to test your functions and the lab2_example.npz file contains an example that can be
used for debugging.
3.1 The models
Load the model file with:
models = np.load(‘lab2_models.npz’)[‘models’]
The models array has length 11 each element corresponding to one of the following digits:
[‘o’,’z’,’1′,’2′,’3′,’4′,’5′,’6′,’7′,’8′,’9′]. Each element contains the following keys:
models[0].keys()
[‘digit’, ‘pron’, ‘hmm’, ‘gmm’]
key description
digit: the digit the model refers to
pron: pronunciation of the digit in terms of phonemes (not needed in this exercise)
hmm: model parameters of a Gaussian HMM trained with Baum-Welch
gmm: model parameters of a GMM trained with EM
Both the HMMs and the GMMs were trained on the same data (different from the data in
the tidigits array) and were initialized the same way. Both use diagonal covariance matrices
for the Gaussian distributions. The size of the model (number of states in HMM and number of
Gaussians in GMM) depends on the digit and was obtained using three states for each phoneme
in models[i][‘pron’] plus three states for the initial and final silence in the utterance. For
example, the digit ‘6’ is pronounced as [‘s’, ‘ih’, ‘k’, ‘s’] and will therefore have (4 +
2) × 3 = 18 states.
The HMM model contains the following fields:
models[0][‘hmm’].keys()
[‘startprob’, ‘transmat’, ‘means’, ‘covars’]
key symbol description
startprob πi = P(z0 = si) probability to start in state i
transmat: aij = P(zn = sj |zn−1 = si) transition probability from state i to j
means: µid array of mean vectors (rows correspond to different states)
covars: σ
id array of variance vectors (rows correspond to different states)
The GMM models contain the following fields:
models[0][‘gmm’].keys()
[‘weights’, ‘means’, ‘covars’]
key symbol description
weights: wi weight of each Gaussian in the mixture
means: µid array of mean vectors (rows correspond to different Gaussians)
covars: σ
id array of variance vectors (rows correspond to different Gaussians)
Check the transition matrix and the start probability for the HMM models. What can you
say about their topology (transition structure)?
3.2 The example
Load the example file with:
example = np.load(‘lab2_example.npz’)[‘example’].item()
This is a dictionary containing all the fields as in the tidigits array, plus the following additional
fields:
example.keys()
[…, ‘gmm_obsloglik’, ‘gmm_loglik’, ‘hmm_obsloglik’, ‘hmm_logalpha’,
‘hmm_loglik’, ‘hmm_vloglik’]
Here is a description of each field. You will see how to use this information in the reminder
of these instructions. All the probabilities described below were obtained using the HMM model
in models[0][‘hmm’] and the GMM model in models[0][‘gmm’] (that is, the models for digit
‘o’) on the sequence of MFCC vectors in example[‘mfcc’]:
key idxs symbol description
gmm_obsloglik [i,j] log φj (xi) observation log likelihood for each Gaussian
in models[0][gmm], shape: (n_timesteps,
n_gaussians)
gmm_loglik – log P(X|θGMM) log likelihood of the observations sequence X
given the full GMM model, scalar
hmm_obsloglik [i,j] log φj (xi) observation log likelihood for each Gaussians
in models[0][hmm], shape: (n_timesteps,
n_states)
hmm_logalpha [i,j] log αj (xi) alpha log probabilities, see definition later,
shape: (n_timesteps, n_states)
hmm_logbeta [i,j] log βj (xi) beta log probabilities, see definition later, shape:
(n_timesteps, n_states)
hmm_loglik – log P(X|θHMM) log likelihood of the observations sequence X
given the HMM model, scalar
hmm_vloglik – log P(X|θHMM, Sopt) Viterbi log likelihood of the observations sequence X given the HMM model and the best
path, scalar
Figure 1 shows some of the relevant fields in example.
4 Multivariate Gaussian Density
The function log_multivariate_normal_density(…, ’diag’) from sklearn.mixture can
be used to compute
log_emlik[i, j] = log φj (xi) = log N(xi
, µj , Σj ) = log P(xi
|µj , Σj ),
that is the log likelihood for each observation xi and each term in a multivariate Gaussian density
function with means µj and diagonal covariance matrices Σj . In the case of Gaussian HMMs,
φj corresponds to the emission probability model for a single state j. In case of a GMM, φj
corresponds to a single Gaussian in the model, without considering the weights.
0
2
4
6
8
10
12
mfcc: MFCC
0
2
4
6
8
gmm_obsloglik: GMM component/observation log likelihood
0
2
4
6
8
hmm_obsloglik: HMM component/observation log likelihood
0
2
4
6
8
hmm_logalpha: log alpha
0
2
4
6
8
hmm_logbeta: log beta
Figure 1.
Verify that you get the same results as in example[‘hmm_obsloglik’] and
example[‘gmm_obsloglik’] when you apply the log_multivariate_normal_density function
to the example[‘mfcc’] data using the Gaussian distributions defined respectively in models[0][‘hmm’]
and models[0][‘gmm’]. Note that in these cases we are using neither the weights of the GMM,
nor the time dependency structure of the HMM, but only the Gaussian distributions.
Plot the log likelihood for Gaussians from HMMs and GMMs models corresponding to the
same digit on a test utterance of your choice. What can you say about the order of the Gaussians,
and the time evolution of the likelihood along the utterance? Remember that each utterance
starts and ends with silence.
5 GMM Likelihood and Recognition
Based on the output of the function in the previous section, write the gmmloglik function
(proto2.py) that computes the log likelihood of an observation sequence X = {x0, . . . , xN−1}
given the full GMM model, that is, given the weights of each Gaussian density in the GMM model.
Remember that the assumption in the GMM model is that the xn vectors are independent for
each n = [0, N)
1
. Whenever there is an expression involving the log of a sum of exponents
(log Pexp(.)), make use of the logsumexp function from tools2.py.
The output of the gmmloglik function, using the model parameters in models[0][‘gmm’] and
the observation sequence from example[‘mfcc’], should correspond to example[‘gmm_loglik’].
Use this function to score each of the 44 utterances in the tidigits array with each of the 11
GMM models in models[i][‘gmm’]. If we select for each utterance the model with the highest
log likelihood, how many digits are misrecognized?
6 HMM Likelihood and Recognition
6.1 Forward Algorithm
Write the function forward following the prototypes in proto2.py. The function should take as
input a lattice of emission probabilities as in the previous section, that is an array containing:
log_emlik[i, j] = log φj (xi)
The output is the array:
logalpha[i, j] = log αi(j)
where i corresponds to time steps and j corresponds to states in the model.
The recursion
formulae for the forward probabilities in log domain are given here, where we have used Python
convention with array indices going from 0 to len-1:
log α0(j) = log πj + log φj (x0)
log αn(j) = log M
X−1
i=0
exp
log αn−1(i) + log aij
!
+ log φj (xn)
In all the cases where there is an expression involving the log of a sum of exponents (log Pexp(.)),
make use of the logsumexp function from tools2.py.
1We are using the convention to express intervals with square brackets “[” if the extreme is included and
parentheses “)” if it is excluded. This corresponds to the range function in Python that includes the start value
but not the end value.
Apply your function to the example[‘mfcc’] utterance using the model parameters in
models[0][‘hmm’] and verify that you obtain the same log α array as in example[‘hmm_logalpha’].
The definitions of αn(j) in terms of probabilities of events are defined below (where θ =
{Π, A, Φ} are the model parameters):
αn(j) = P(x0, . . . , xn, zn = sj |θ)
Given the above definition, derive the formula to compute the likelihood P(X|θ) of the whole
sequence X = {x0, . . . , xN−1}, given the model parameters θ in terms of αn(j).
Hint: if the events Bj , j ∈ [0, M) form a disjoint partition of the sure event, that is:
P(Bi
, Bj ) = 0, ∀i, j, i 6= j, and
M
X−1
j=0
P(Bj ) = 1,
then it is true that P(A) = PM−1
j=0 P(A, Bj ), that is, we can marginalize out the variable
Bj .
Convert the formula you have derived into log domain. Verify that the log likelihood obtained
this way using the model parameters in models[0][‘hmm’] and the observation sequence in
example[‘mfcc’] is the same as example[‘hmm_loglik’].
Using your formula, score all the 44 utterances in tidigits with each of the 11 HMM
models in models. How many mistakes can you count if you take the maximum likelihood model
as winner? Compare these results with the GMM results obtained in the previous session.
Repeat the recognition using the Gaussian distributions in the HMM models as if they were
a GMM model with equal weights for each Gaussian. What can you say about the results you
obtain this way? Can you say something about the influence of the HMM transition model in
this particular task?
Plot the α lattice (that is the array of αs for each time step and state). What can you say
about the influence of the HMM topology on the first time steps? How about the last ones?
6.2 Viterbi Approximation
Here you will compute the log likelihood of the observation X given a HMM model and the best
sequence of states. This is called the Viterbi approximation. Implement the function viterbi
in proto2.py. The Viterbi recursion formulas are as follows:
log V0(j) = log πj + log φj (x0)
log Vn(j) = M−1
max
i=0
log Vn−1(i) + log aij
+ log φj (xn)
In order to recover the best path, you will also need an array storing the best previous path
for each time step and state (this needs to be defined only for n ∈ [1, N), that is not for the first
time step):
Bn(j) = arg M−1
max
i=0
log Vn−1(i) + log aij
Consult the course book [1], Section 8.2.3, to see the details of the implementation (note that
the book uses indices from 1, instead of 0).
Compute the Viterbi log likelihood for models[0][‘hmm’] and example[‘mfcc’], and verify
that you obtain the same result as in example[‘hmm_vloglik’].
Plot the alpha array that you obtained in the previous Section and overlay the best path
obtained by Viterbi decoding. Can you explain the reason why the path looks this way?
Using the Viterbi algorithm, score all the 44 utterances in tidigits with each of the 11
HMM models in models. How many mistakes can you count if you take the maximum likelihood
model as winner? Compare these results with the results obtained in previous sections.
6.3 Optional: Backward Algorithm
Write the function backward following the prototypes in proto2.py. Similarly to the function
forward in the previous section, the function should take as input a lattice of emission probabilities as in the previous section, that is an array containing:
log_emlik[i, j] = log φj (xi)
The output is the array:
logbeta[i, j] = log βi(j),
where i corresponds to time steps and j corresponds to states in the model. The recursion
formulae for the backward probabilities in log domain are given here, where we have used Python
convention with array indices going from 0 to len-1:
log βN−1(i) = 0
log βn(i) = log
M
X−1
j=0
exp
log aij + log φj (xn+1) + log βn+1(j)
Note also that the initialization of the βN−1(i) is different from the one defined in the course
book [1] (Section 8.2.4) but corresponds to the one used in [2].
In all the cases where there is an expression involving the log of a sum of exponents (log Pexp(.)),
make use of the logsumexp function in tools2.py.
Apply your function to the example[‘mfcc’] utterance using the model parameters in
models[0][‘hmm’] and verify that you obtain the log β arrays as in example[‘hmm_logbeta’].
The definitions of βn(j) in terms of probabilities of events are defined below (where θ =
{Π, A, Φ} are the model parameters):
βn(i) = P(xn+1, . . . , xN−1|zn = si
, θ)
Advanced: Derive the formula that computes P(X|θ) using the betas βn(i) instead of the
alphas.
Hint 1: only the β0(i) are needed for the calculation.
Hint 2: note that the definition of αn(j) is a joint probability of observations and state at
time step n whereas βn(i) is a conditional probability of the observations given the
state at time step n.
Hint 3: the calculation will not only involve the betas, but also some of the observation
likelihoods φj (xn) and some of the model parameters, πi or aij .
Verify that also this method gives you the same log likelihood and that they both areequal to example[‘hmm_loglik’].
References
[1] Xuedong Huang, Alex Acero, and Hsiao-Wuen Hon. Spoken Language Processing: A Guide
to Theory, Algorithm and System Development. Prentice Hall PTR, 2001.
[2] L. R. Rabiner. “A tutorial on hidden Markov models and selected applications in speech
recognition”. In: Proceedings of the IEEE 77.2 (Feb. 1989), pp. 257–286.