## Description

P1: DCT and PCA [5 points]

1. I like walking in the B-Line trail (although it doesn’t mean that I have time to walk there

frequently). IMG 1878.JPG is the photo I took there. Load it and divide the 3D array

(1024 × 768 × 3) into the three channels. Let’s call them XR, XG, and XB.

2. Randomly choose a block of 8 consecutive (entire) rows from XR, e.g. XR

(113:120,1:768) (See

the red box in Figure 1). This will be a matrix of 8 ×768. Collect another 2 such blocks, each

of which starts from a randomly chosen first row position. Move on to the green channel and

extract another three 8 × 768 blocks. Blue channel, too. You collected 9 blocks from all three

channels. Now, concatenate all of them horizontally. This will be a matrix of 8 × 6912 pixels.

Let’s call it R.

3. Subtract the mean vector of size 8 × 1 from all 6912 vectors.

4. Calculate the covariance matrix, which will be an 8 × 8 matrix.

5. Do eigendecomposition on the covariance matrix (feel free to use a toolbox) and extract 8

eigenvectors, each of which is with 8 dimensions. Yes, you did PCA. Imagine that you convert

the original 8×6912 matrix into the other space using the learned eigenvectors. For example,

if your eigenvector matrix is W, than W⊤R will do it. Plot your W⊤ and compare it to the

DCT matrix shown in M02-S21. Similar? Submit your plot and code.

1

Figure 1: B-Line trail

6. We just saw that PCA might be able to replace DCT. But, it seems to depend on the quality

of PCA. One way to improve the quality is to increase the size of your data set, so that you

can start from a good sample covariance matrix. To do so, go back to the R matrix generation

procedure. But, this time, increase the total number of blocks to 90 (30 blocks per channel).

Note that each block is with 8 × 768 pixels once again. See if the eigenvectors are better

looking (submit the plot).

P2: Instantaneous Source Separation [6 points]

1. From x ica 1.wav to x ica 4.wav are four recordings we observed at an audio scene. In this

audio scene, there are three speakers saying something at the same time plus a motorcycle

passing by. You may want to listen to those recordings to check out who says what, but

I made it very careful so that you guys cannot understand what they are saying. In other

words, I multiplied a 4 × 4 mixing matrix A to the four sources to create the four channel

mixture:

x1(t)

x2(t)

x3(t)

x4(t)

= A

s1(t)

s2(t)

s3(t)

s4(t)

(1)

2. But, as you’ve learned how to do source separation using ICA, you should be able to separate

them out into four sources: three clean speech signals and the motorcycle noise. Listen to

your separated sources and transcribe what they are saying. Submit your separated .wav files

along with your transcription.

2

3. At every iteration of the ICA algorithm, use these as your update rules:

∆W ←

NI − g(Y )f(Y )

′

W (2)

W ← W + ρ∆W (3)

Y ← W Z (4)

where

W : The ICA unmixing matrix you’re estimating (5)

Y : The 4 × N source matrix you’re estimating (6)

Z : Whitened version of your input (using PCA) (7)

g(x) : tanh(x),(works element-wise) (8)

f(x) : x

3

,(works element-wise) (9)

ρ : learning rate (10)

N : number of samples (11)

4. Don’t forget to whiten your data before applying ICA!

5. Implementation notes: Depending on the choice of the learning rate the convergence of the

ICA algorithm varies. But I always see the convergence in from 5 sec to 90 sec in my desktop

computer.

P3: Ideal Masks [4 points]

1. piano.wav and ocean.wav are two sources you’re interested in. Load them separately and

apply STFT with 1024 point frames and 50% overlap. Use Hann windows. Let’s call these

two spectrograms S and N, respectively. Discard the complex conjugate part, so eventually

they will be an 513 × 158 matrix1

. Later on in this problem when you recover the time

domain signal out of this, you can easily recover the discarded half from the existing half so

that you can do inverse-DFT on the column vector of full 1024 points. Hint: Why 513, not

512? Create a very short random signal with 16 samples, and do a DFT transform to convert

it into a spectrum of 16 complex values. Check out their complex coefficients to see why you

need N/2 + 1, not N/2.

I will allow you to use other implementations, such as librosa.stft, but I strongly encourage

you to reuse your code from Homework 2.

2. Now you build a mixture spectrogram by simply adding the two source spectrograms: X =

S + N. Note that all the numbers here are complex values.

3. Since you know the sources, the source separation job is trivial. One way is to calculate the

ideal masks M =

S

S+N

(once again, note that they are all complex valued and the division

is element-wise). By the definition of the mixture spectrogram, S = M ⊙ X, where ⊙ stands

for a Hadamard product. But we won’t use this one today.

1The exact number of columns may be different depending on your STFT setup. If it’s in the same ball park, it’s

okay.

4. Sometimes we can only estimate a nonnegative real-valued masking matrix M¯ especially if

we don’t have an access to the phase of the sources. For example, M¯ =

|S|

2

|S|

2+|N|

2

. Go

ahead and calculate M¯ from your sources, and multiply it to your mixture spectrogram, i.e.

S ≈ M¯ ⊙ X. Convert your estimated piano spectrogram back to the time domain. Submit

the .wav file of your recovered piano source.

5. Listen to the recovered source. Is it too different from the original? One way to objectively

measure the quality of the recovered signal is to compare it to the original signal by using a

metric called Signal-to-Noise Ratio (SNR):

SNR = 10 log10 P

t

{s(t)}

2

P

t

{s(t) − sˆ(t)}

2

, (12)

where s(t) is the t-th sample of the original source and ˆs(t) is that of the recovered one.

Evaluate the SNR between piano.wav and your reconstruction for it. Note: their lengths

could be slightly different. Just ignore the small difference in the end.

6. Yet another masking scheme is something called Ideal Binary Masks (IBM). This time, we

use a binary (0 or 1) masking matrix B, which is definded by

Bf t =

1 if |S|f t > |N|f t

0 otherwise (13)

7. Create your IBM from the sources, and apply it to your mixture spectrogram, S ≈ B ⊙ X.

Do the inverse STFT. How does it sound? What’s its SNR value?

8. Don’t forget to create audio players for the sound examples in your iPython (i.e., jupyter or

Google Colab) notebook. Check if they play sound in the .html version.

4