## Description

P1: Kernel PCA [5 points]

1. concetric.mat contains 152 data points each of which is a 2-dimensional column vector. If

you scatter plot them on a 2D space, you’ll see that they form the concentric circles you saw

in class.

2. Do kernel PCA on this data set to transform them into a new 3-dimensional feature space,

i.e. X ∈ R

2×152 ⇒ Kernel PCA ⇒ Z ∈ R

3×152

.

3. In theory, you have to do the centering and normalization procedures in the feature space,

but for this particular data set, you can forget about it.

4. You remember how the after-transformation 3D features look like, right? Now your data set

is linearly separable. Train a perceptron that does this linear classification. In other words,

your perceptron will take the transformed version of your data (a 3-dimensional vector) and

do a linear combination with three weights w = [w1, w2, w3]

⊤ plus the bias term b:

yi = σ

w⊤Z:,i + b

, (1)

where yt is the scalar output of your perceptron and σ is the activation function you define.

If you choose logistic function as your activation, then you’d better label your samples with

0 or 1. You know, the first 51 samples are for the inner circle, so their label will be 0, while

for the other outer ring samples, I’d label them with 1.

5. You’ll need to write a backpropagation algorithm to estimate your parameters w and b, which

minimize the error between yi and ti

. There are a lot of choices, but you can use the sum of

the squared error: P

i

1

2

(yi − ti)

2

. Note that this dummy perceptron doesn’t have a hidden

layer, because the nonlinearity is taken care of by the kernel method. Note that you may

want to initialize your parameters with some small (uniform or Gaussian) random numbers

centered around zero.

6. Your training procedure will be sensitive to the choice of the learning rate and the initialization

scheme (the range of your random numbers). So, you may want to try out a few different

choices. Good luck!

7. Do NOT use TensorFlow or PyTorch’s automatic gradient computation feature for this. You

have to write your own backpropagation routine in Numpy.

P2: Neural Networks [5 points]

1. Build a neural network that has a single hidden layer for the concentric circles problem.

Instead of taking the kernel PCA results as the input, your neural network will take the raw

X matrix as the input. Therefore, instead of a perceptron with 3 input units, now your neural

network will have 2 input units as in the previous problem, each of which will take one of the

coordinates of a sample.

2. As we skip kernel PCA, we need to implement the nonlinear part within the network. To

this end, we will convert this 2D data point into a 3D feature vector first, a procedure that

corresponds to the first layer of your neural network:

X

(2)

:,i = σ

W(1)X

(1)

:,i + b

(1)

, (2)

where X

(1)

:,i ∈ R

2×1

is the input to the network (i-th column vector of X), while X

(2)

:,i ∈ R

3×1

is the output of the hidden units. What that means is that the weight matrix W(1) ∈ R

3×2

and the bias vector b

(1) ∈ R

3×1

.

3. In the next (i.e., final) layer, you’ll do the usual linear classification on X

(2)

:,i , which will be

similar to the perceptron you developed in Problem 3, although this time the input to the

perceptron is X

(2)

:,i , not the kernel PCA results:

yi = σ

w(2)⊤X

(2)

:,i + b

(2)

, (3)

where w(2) and b

(2) correspond to w and b in Problem 1, respectively.

4. Note that you train these two layers altogether. Your backpropagation should work from the

final layer back to the first layer. Eventually, you want to estimate these parameters with

your backpropagation: W(1) ∈ R

3×2

, w(2) ∈ R

3×1

, b

(1) ∈ R

3×1

, b(2) ∈ R

1

.

5. Again, do NOT use any of the deep learning frameworks, such as TensorFlow or PyTorch,

to get around the differentiation. Write up your own backpropagation routine and update

algorithms.

P3: Spoken MNIST [5 points]

1. Download the spoken digit dataset from here: https://zenodo.org/record/1342401#.YjyUfiB0UE.

2. In the recordings subfolder, you will see a bunch of .wav files. Its encoding schme is like this:

[digit class] [subject name] [example number].wav. For example, 7 jackson 30.wav

means 30th recording of Jackson pronouncing number 7. There are four subjects, each speaks

each digit 50 times. Today, we will focus on Jackson’s recordings. Take a listen to them to

see how they sound like.

3. Divide Jackson’s recordings into training and testing sets. Since there are 50 recordings per

digit, use 45 of them for training and 5 of them as testing. Then, there will be 450 training

examples in total for all 10 digits and 50 for testing.

4. Train 10 HMM models, one for each digit class. You will use 45 recordings of one digit class

to train one HMM model and so on.

5. This time, I wouldn’t ask you to implement the whole HMM learning algorithms as they

are so complicated. But I want you to know the basic mechanism how a small-scale speech

recognition system works. So, let’s use a Python HMM toolbox, which you can find here:

https://hmmlearn.readthedocs.io/en/latest/.

6. There are various HMM versions you can choose from, but let’s use GMMHMM, which might

be the most popular choice anyway. I briefly mentioned about it at the end of the HMM

module, but it’s basically a variant of HMM whose emission probabilities are defined by a

GMM. If you think of HMM as an elaborated GMM with complicated prior (i.e., the transition

probabilities), GMM-HMM can be seen as a GMM model working on top of another GMM

model, necessitating nested EM-like algorithms. Anyway, you don’t have to worry about it

today.

7. First, for digit class C, let’s convert each of the 45 utterances into an MFCC spectrogram.

You will use a toolbox for this: librosa.feature.mfcc. As a result, you produce X

(C)

i ∈

R

20×T

(C)

i , where T

(C)

i

is the number of MFCC vectors of the i-th utterance in the digit class

C. Store both the MFCC matrix X

(C)

i

and length information T

(C)

i

. Repeat this for all 45

utterances.

8. In theory, you can feed the 45 utterances one by one to your HMM training algorithm, but the

HMM toolbox you use has a more elegant way to deal with the situation. You will concatenate all the MFCC matrices from 45 utterances into a one big matrix and feed it just once:

X(C) = [X

(C)

1

, X

(C)

2

, . . . , X

(C)

45 ]. The only thing is, the algorithm doesn’t know the boundaries between different utterances. Once again, the toolbox provides a nice way to inform

the algorithm of this, by feeding a vector of per-utterance length information, i.e., T

(C) =

[T

(C)

1

, T(C)

2

, . . . , T(C)

45 ]. Take a look at the manual of the fit function of the GMMHMM class

here: https://hmmlearn.readthedocs.io/en/latest/api.html#hmmlearn.hmm.GMMHMM.

9. So, you first need to initialize a GMMHMM class and call the fit function to train it. When

you initialize it, you can choose to change a few hyperparameters, such as the number of

iterations, the tolerance threshold (for the stopping criterion), the number of hidden states

3

and GMM latent components (small numbers just work fine for these), etc. I adjusted them

to instruct the learning algorithm run for a long enough time to converge.

10. Repeat this process for all digit classes: C = {0, 1, . . . , 9}. Now you have 10 HMM models

trained from each of the digit classes.

11. Convert your test utterances into the MFCC matrices in the same way as above. Now you

have 5 uttarences per class to test out. Feed all five of them (as a big matrix along with the

length information) to each of the 10 HMM instances for testing. For this, you will use the

score function. This will give you the “log-likelihood” score. For each digit-specific test set,

you have 10 numbers from 10 HMM instances. The one that gives you the largest number is

the digit class the test set belongs to. Report these numbers.

12. Draw a confusion matrix, whose x-axis is the ground-truth digit class label and y-axis is the

prediction. Hence, for a given row of the confusion matrix, the brightest pixel should be on

the diagonal element if the classification is accurate. Submit this confusion matrix plot, too.

4