## Description

P1: Stereo Matching (revisited) [5 points]

1. im0.ppm (left) and im8.ppm (right) are the pictures taken by two different camera positions1

.

If you load the images, they will be a three dimensional array of 381 × 430 × 3, whose third

dimension is for the three color channels (RGB). Let’s call them XL and XR. For the (i,j)-th

pixel in the right image, XR

(i,j,:), which is a 3-d vector of RGB intensities, we can scan and

find the most similar pixel in the left image at i-th row (using a metric of your choice). For

example, I did the search from XL

(i,j,:) to XL

(i,j+39,:), to see which pixel among the 40 are

the closest. I record the index-distance of the closest pixel. Let’s say that XL

(i,j+19,:) is the

most similar one to XR

(i,j,:). Then, the index-distance is 19. I record this index-distance (to

the closest pixel in the left image) for all pixels in my right image to create a matrix called

“disparity map”, D, whose (i, j)-th element says the index-distance between the (i, j)-th pixel

of the right image and its closest pixel in the left image. For an object in the right image, if

its pixels are associated with an object in the left image, but are shifted far away, that means

the object is close to the camera, and vice versa.

2. Calculate the disparity map D from im0.ppm and im8.ppm, which will be a matrix of 381×390

(since we search within only 40 pixels). Vectorize the disparity matrix and draw a histogram.

How many clusters do you see?

3. Write up your own GMM clustering code, and cluster the disparity values in D. Each value

will belong to (only) one of the clusters. The number of clusters says the number of depth

1http://vision.middlebury.edu/stereo/data/

1

levels. If you replace the disparity values with the cluster means, you can recover the depth

map with k levels. Plot your depth map (the disparity map replaced by the mean disparities

as in the image quantization examples) in gray scale–pixels of the frontal objects should be

bright, while the ones in the back get darker.

4. Extend your implementation with the MRF’s smoothing priors using an eight neighborhood

system (e.g. Ni,j =

(i − 1, j − 1),(i − 1, j),(i − 1, j + 1),(i, j − 1),(i, j + 1),(i + 1, j − 1),(i +

1, j),(i+ 1, j + 1)

. Feel free to choose either ICM or Gibbs sampling. Show me the smoothed

results. You can use the Gaussian-kernel-looking prior probability equations discussed in class

(M10 S8).

P2: Probabilistic Latent Semantic Indexing (PLSI) for Speech Denoising

[5 points]

1. Convert the two training signals trs.wav and trn.wav using the earlier STFT setup you used

in the previous homework. Let’s call these complex-valued matrices S and N.

2. Build a speech denoising system by using the PLSI algorithm. The overall procedure is

described in M11-S33. The update rules are on M11-S18 (use the ones on the top of the slide,

not the NMF ones).

3. First, run your PLSI algorithm to learn B

(1) from |S| (the magnitudes).

4. Second, run your PLSI algorithm to learn B

(2) from |N|.

5. Third, load your test mixture tex.wav and turn it into a spectrogram X. Run your third

PLSI routine here to learn Θ(3) by taking the magnitudes |X|. Remember, you don’t want

to update B

(1) and B

(2) during testing. You just initialize them from the ones you trained.

6. You know, the speech source can be first recovered by doing B

(1)Θ

(3)

1:Ks,:

, where Ks is the

number of basis vectors in B

(1). It means that you need to pick your own choice of Ks and

Kn during training.

7. However, B

(1)Θ

(3)

1:Ks,: will only give you the “probability matrix” of the speech source, not

the actual spectrogram. Hence, you need to turn it into some kind of TF-bin-wise posterior

probability (of belonging to the speech source) and use it as a mask:

Sˆ

test =

B

(1)Θ

(3)

1:Ks,:

[B

(1)

, B

(2)]Θ(3) ⊙ X, (1)

where [B

(1)

, B

(2)] must be a F × (Ks + Kn) matrix and ⊙ is a Hadamard product.

8. Transform Sˆ

test back to the time domain.

9. Report the SNR value of the separation result by comparing sˆtest to tes.wav.

10. Plug in the recovered speech to the notebook with a player (so that the AI can listen to it).

2

P3: PLSI for Analyzing Twitter Stream [5 points]

1. twitter.mat holds two Term-Frequency (TF) matrices Xtr and Xte. It also contains YtrM at

and YteM at, the target variables in the one-hot vector format.

2. Each column of the TF matrix Xtr can be either “positive”, “negative”, or “neutral”, which

are represented numerically as 1, 2, and 3 in the YtrM at. They are sentimental classes of the

original twits.

3. Learn 50 PLSI topics B ∈ R

891×50 and their weights Θtr ∈ R

50×773 from the training data

Xtr, using the ordinary PLSI update rules.

4. Reduce the dimension of Xte down to 50, by learning the weight matrix Θte ∈ R

50×193. This

can be done by doing another PLSI on the test data Xte, but this time by reusing the topic

matrix B you learned from the training set. So, you skip the update rule for B. You only

update Θte ∈ R

50×193

.

5. Define a perceptron layer for the softmax classification. This part is similar to the case with

kernel PCA with a perceptron as you did in Homework #5. Instead of the kernel PCA results

as the input to the perceptron, you use Θtr for training, and Θte for testing. This time the

number of output units is 3 as there are three classes, and that’s why the target variable

YtrM at is with three elements. Review M6 S37-39 to review what softmax is.

6. Report your classification accuracy.

7. Note: do NOT use any deep learning framework that supports softmax implementation and

automatic gradient computation. Use your own implementation of softmax and backpropagation.

3