## Description

## 1 Objective

The objective is to experiment with different features commonly used for speech analysis and

recognition. The lab is designed in Python, but the same functions can be obtained in Matlab/Octave or using the Hidden Markov Toolkit (HTK). In Appendix A, a reference table is

given indicating the correspondence between different systems.

## 2 Task

• compute MFCC features step-by-step

• examine features

• evaluate correlation between feature

• compare utterances with Dynamic Time Warping

• illustrate the discriminative power of the features with respect to words

• perform hierarchical clustering of utterances

• train and analyze a Gaussian Mixture Model of the feature vectors.

In order to pass the lab, you will need to follow the steps described in this document, and

present your results to a teaching assistant. Use Canvas to book a time slot for the presentation.

Remember that the goal is not to show your code, but rather to show that you have understood

all the steps.

## 3 Data

The files tidigits.npz and example.npz contain the data to be used for this exercise1. The

files contains two arrays: tidigits and example2.

1Note on Python 3: the file formats between Python 2 and 3 are not compatible. If you use version 3, use the

file tidigits_python3.npz and example_python3.npz, instead. 2If you wish to use Matlab/Octave instead of Python, use the provided py2mat.py script to convert to Matlab

format. Load the file with load tidigits or example. You will load two cell arrays with the corresponding data

stored in structures.

3.1 example

The array example can be used for debugging because it contains calculations of all the steps in

Section 4 for one utterance. It can be loaded with:

import numpy as np

example = np.load(‘example.npz’)[‘example’].item()

The element example is a dictionary with the following keys:

samples: speech samples for one utterance

samplingrate: sampling rate

frames: speech samples organized in overlapping frames

preemph: pre-emphasized speech samples

windowed: hamming windowed speech samples

spec: squared absolute value of Fast Fourier Transform

mspec: natural log of spec multiplied by Mel filterbank

mfcc: Mel Frequency Cepstrum Coefficients

lmfcc: Liftered Mel Frequency Cepstrum Coefficients

Figure 1 shows the content of the elements in example.

3.2 tidigits

The array tidigits contains a small subset of the TIDIGITS database (https://catalog.ldc.

upenn.edu/LDC93S10) consisting of a total of 44 spoken utterances from one male and one female

speaker3. The file was generated with the script tidigits.py4. For each speaker, 22 speech files

are included containing two repetitions of isolated digits (eleven words: “oh”, “zero”, “one”, “two”,

“three”, “four”, “five”, “six”, “seven”, “eight”, “nine”). You can read the file from Python with:

tidigits = np.load(‘tidigits.npz’)[‘tidigits’]

The variable tidigits is an array of dictionaries. Each element contains the following keys:

filename: filename of the wave file in the database

samplingrate: sampling rate of the speech signal (20kHz in all examples)

gender: gender of the speaker for the current utterance (man, woman)

speaker: speaker ID for the current utterance (ae, ac)

digit: digit contained in the current utterance (o, z, 1, …, 9)

repetition: whether this was the first (a) or second (b) repetition

samples: array of speech samples

4 Mel Frequency Cepstrum Coefficients step-by-step

Follow the steps below to computer MFCCs. Use the example array to double check that your

calculations are right.

You need to implement the functions specified by the headers in proto.py. Once you have

done this, you can use the function mfcc in tools.py to compute MFCC coefficients in one go.

3The complete database contains recordings from 225 speakers

4The script is included only for reference in case you need to use the full database in the future. In that case,

you will need access to the KTH AFS file system.

Figure 1. Evaluation of MFCCs step-by-step

4.1 Enframe

Implement the enframe function in proto.py. This will take as input speech samples, the frame

length in samples and the number of samples overlap between consecutive frames and outputs a

two dimensional array where each row is a frame of samples. Consider only the frames that fit

into the original signal disregarding extra samples.

Apply the enframe function to the utterance

example[‘samples’] with window length of 20 milliseconds and shift of 10 ms (figure out the

length and shift in samples from the sampling rate, and write it in the lab report). Use the

pcolormesh function from matplotlib.pyplot to plot the resulting array. Verify that your

result corresponds to the array in example[‘frames’].

4.2 Pre-emphasis

Implement the preemp function in proto.py. To do this, define a pre-emphasis filter with preemphasis coefficient 0.97 using the lfilter function from scipy.signal. Explain how you

defined the filter coefficients. Apply the filter to each frame in the output from the enframe

function. This should correspond to the example[‘preemph’] array.

4.3 Hamming Window

Implement the windowing function in proto.py. To do this, define a hamming window of the

correct size using the hamming function from scipy.signal with extra option sym=False5. Plot

the window shape and explain why this windowing should be applied to the frames of speech

signal. Apply hamming window to the pre-emphasized frames of the previous step. This should

correspond to the example[‘windowed’] array.

4.4 Fast Fourier Transform

Implement the powerSpectrum function in proto.py. To do this, compute the Fast Fourier

Transform (FFT) of the input from scipy.fftpack and then the squared modulus of the result.

Apply your function to the windowed speech frames, with FFT length of 512 samples. Plot the

resulting power spectrogram with pcolormesh. Beware of the fact that the FFT bins correspond

to frequencies that go from 0 to fmax and back to 0. What is fmax in this case according to

the Sampling Theorem? The array should correspond to example[‘spec’].

4.5 Mel filterbank log spectrum

Implement the logMelSpectrum function in proto.py. Use the trfbank function, provided in

the tools.py file, to create a bank of triangular filters linearly spaced in the Mel frequency

scale. Plot the filters in linear frequency scale. Describe the distribution of the filters along the

frequency axis. Apply the filters to the output of the power spectrum from the previous step

for each frame and take the natural log of the result. Plot the resulting filterbank outputs with

pcolormesh. This should correspond to the example[‘mspec’] array.

4.6 Cosine Transofrm and Liftering

Implement the cepstrum function in proto.py. To do this, apply the Discrete Cosine Transform (dct function from scipy.fftpack.realtransforms) to the outputs of the filterbank. Use

5The meaning of this option is beyond the scope of this course, but you should use it if you want to get the

same results as in the example.

coefficients from 0 to 12 (13 coefficients). Then apply liftering using the function lifter in

tools.py. This last step is used to correct the range of the coefficients. Plot the resulting coefficients with pcolormesh. These should correspond to example[‘mfcc’] and example[‘lmfcc’]

respectively.

Once you are sure all the above steps are correct, use the mfcc function (tools.py) to

compute the liftered MFCCs for all the utterances in the tidigits array. Observe differences

for different utterances.

## 5 Feature Correlation

Concatenate all the MFCC frames from all utterances in the tidigits array into a big feature

[N ⇥ M] array where N is the total number of frames in the data set and M is the number of

coefficients. Then compute the correlation coefficients between features and display the result

with pcolormesh. Are features correlated? Is the assumption of diagonal covariance matrices

for Gaussian modelling justified? Compare the results you obtain for the MFCC features with

those obtained with the Mel filterbank features (‘mspec’ features).

## 6 Comparing Utterances

Given two utterances of length N and M respectively, compute an [N ⇥ M] matrix of local

Euclidean distances between each MFCC vector in the first utterance and each MFCC vector in

the second utterance.

Write a function called dtw (proto.py) that takes as input this matrix of local distances

and outputs the result of the Dynamic Time Warping algorithm. The main output is the global

distance between the two sequences (utterances), but you may want to output also the best path

for debugging reasons.

For each pair of utterances in the tidigits array:

1. compute the local Euclidean distances between MFCC vectors in the first and second

utterance

2. compute the global distance between utterances with the dtw function you have written

Store the global pairwise distances in a matrix D (44⇥44). Display the matrix with pcolormesh.

Compare distances within the same digit and across different digits. Does the distance separate

digits well even between different speakers?

Run hierarchical clustering on the distance matrix D using the linkage function from

scipy.cluster.hierarchy. Use the ”complete” linkage method. Display the results with the

function dendrogram from the same library, and comment them. Use the tidigit2labels function (tools.py) to create labels to add to the dendrogram to simplify the interpretation of the

results.

## 7 Explore Speech Segments with Clustering

Train a Gaussian mixture model with sklearn.mixture.GMM. Vary the number of components for

example: 4, 8, 16, 32. Consider utterances containing the same words and observe the evolution

of the GMM posteriors. Can you say something about the classes discovered by the unsupervised

learning method? Do the classes roughly correspond to the phonemes you expect to compose

each word? Are those classes a stable representation of the word if you compare utterances from

different speakers. As an example, plot and discuss the GMM posteriors for the model with 32

components for the four occurrences of the word “seven” (utterances 16, 17, 38, and 39).

A Alternative Software Implementations

Although this lab has been designed for being carried out in Python, several implementations of

speech related functions are available.

A.1 Matlab/Octave Instructions

The Matlab signal processing toolbox is one of the most complete signal processing piece of software available. Many speech related functions are however implemented in third party toolboxes.

The most complete are the Voicebox6 which is more oriented towards speech technology and the

Auditory Toolbox7 that is more focused on human auditory models.

If you use Octave instead of Matlab, make sure you have the following extra packages (in

parentheses are the names of the corresponding apt-get packages for Debian based GNU Linux

distributions, all packages are already installed on CSC Ubuntu machines):

• signal (octave-signal)

A.2 Hidden Markov Models Toolkit (HTK)

HTK is a powerful toolkit developed by Cambridge University for performing HMM-based speech

recognition experiments. The HTK package is available at all CSC Ubuntu stations, or can be

download for free at http://htk.eng.cam.ac.uk/ after registration to the site. Its manual,

the HTK Book, can be downloaded separately.

In spite of being open source and free of charge,

HTK, is unfortunately not free software in the Free Software Foundation sense because neither its

original form nor its modifications can be freely distributed. Please refer to the license agreement

for more information.

The HTK commands that are relevant to this exercise are the following:

HCopy: feature extraction tool. Can read audio files or feature files in HTK format and outputs

HTK format files

HList: terminal based visualization of features. Reads HTK format feature files and displays

information about them

General options are:

• -C config: reads configuration file conf

• -S filelist: reads list of files to process from filelist

for a complete list of options and usage information, run the commands without arguments.

Hint: HList -r …: the -r option in HList will output the feature data in raw (ascii)

format. This will make it easy to import the features in other programs such as python, Matlab

or R.

Table 2 lists a number of possible spectral features and the corresponding HTK codes to be

used in HCopy or HList.

6http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html

7http://amtoolbox.sourceforge.net/

Feature name Matlab Python

Linear filter filter scipy.signal.lfilter

Hamming window hamming scipy.signal.hamming

Fast Fourier Transform fft scipy.fftpack.fft

Discrete Cosine Transform dct scipy.fftpack.realtransforms.dct

Gaussian Mixture Model gmdistribution sklearn.mixture.GMM

Hierarchical clustering linkage scipy.cluster.hierarchy.linkage

Dendrogram dendrogram scipy.cluster.hierarchy.dendrogram

Plot lines plot matplotlib.pyplot.plot

Plot arrays image, imagesc matplotlib.pyplot.pcolormesh

Table 1. Mapping between Matlab and Python functions used in this exercise

Feature name KTH code

linear filer-bank parameters MELSPEC

log filter-bank parameters FBANK

Mel-frequency cepstral coefficients MFCC

linear prediction coefficients LPC

Table 2. Feature extraction in HTK. The HCopy executable can be used to generate features from

wave file to feature file. HList can be used to output the features in text format to stdout, for easy

import in other systems