(ENGR-E 511) Homework 6 Stereo Matching solution




5/5 - (4 votes)

P1: Stereo Matching (revisited) [5 points]
1. im0.ppm (left) and im8.ppm (right) are the pictures taken by two different camera positions1
If you load the images, they will be a three dimensional array of 381 × 430 × 3, whose third
dimension is for the three color channels (RGB). Let’s call them XL and XR. For the (i,j)-th
pixel in the right image, XR
(i,j,:), which is a 3-d vector of RGB intensities, we can scan and
find the most similar pixel in the left image at i-th row (using a metric of your choice). For
example, I did the search from XL
(i,j,:) to XL
(i,j+39,:), to see which pixel among the 40 are
the closest. I record the index-distance of the closest pixel. Let’s say that XL
(i,j+19,:) is the
most similar one to XR
(i,j,:). Then, the index-distance is 19. I record this index-distance (to
the closest pixel in the left image) for all pixels in my right image to create a matrix called
“disparity map”, D, whose (i, j)-th element says the index-distance between the (i, j)-th pixel
of the right image and its closest pixel in the left image. For an object in the right image, if
its pixels are associated with an object in the left image, but are shifted far away, that means
the object is close to the camera, and vice versa.
2. Calculate the disparity map D from im0.ppm and im8.ppm, which will be a matrix of 381×390
(since we search within only 40 pixels). Vectorize the disparity matrix and draw a histogram.
How many clusters do you see?
3. Write up your own GMM clustering code, and cluster the disparity values in D. Each value
will belong to (only) one of the clusters. The number of clusters says the number of depth
levels. If you replace the disparity values with the cluster means, you can recover the depth
map with k levels. Plot your depth map (the disparity map replaced by the mean disparities
as in the image quantization examples) in gray scale–pixels of the frontal objects should be
bright, while the ones in the back get darker.
4. Extend your implementation with the MRF’s smoothing priors using an eight neighborhood
system (e.g. Ni,j =

(i − 1, j − 1),(i − 1, j),(i − 1, j + 1),(i, j − 1),(i, j + 1),(i + 1, j − 1),(i +
1, j),(i+ 1, j + 1)
. Feel free to choose either ICM or Gibbs sampling. Show me the smoothed
results. You can use the Gaussian-kernel-looking prior probability equations discussed in class
(M10 S8).
P2: Probabilistic Latent Semantic Indexing (PLSI) for Speech Denoising
[5 points]
1. Convert the two training signals trs.wav and trn.wav using the earlier STFT setup you used
in the previous homework. Let’s call these complex-valued matrices S and N.
2. Build a speech denoising system by using the PLSI algorithm. The overall procedure is
described in M11-S33. The update rules are on M11-S18 (use the ones on the top of the slide,
not the NMF ones).
3. First, run your PLSI algorithm to learn B
(1) from |S| (the magnitudes).
4. Second, run your PLSI algorithm to learn B
(2) from |N|.
5. Third, load your test mixture tex.wav and turn it into a spectrogram X. Run your third
PLSI routine here to learn Θ(3) by taking the magnitudes |X|. Remember, you don’t want
to update B
(1) and B
(2) during testing. You just initialize them from the ones you trained.
6. You know, the speech source can be first recovered by doing B
, where Ks is the
number of basis vectors in B
(1). It means that you need to pick your own choice of Ks and
Kn during training.
7. However, B
1:Ks,: will only give you the “probability matrix” of the speech source, not
the actual spectrogram. Hence, you need to turn it into some kind of TF-bin-wise posterior
probability (of belonging to the speech source) and use it as a mask:

test =
, B
(2)]Θ(3) ⊙ X, (1)
where [B
, B
(2)] must be a F × (Ks + Kn) matrix and ⊙ is a Hadamard product.
8. Transform Sˆ
test back to the time domain.
9. Report the SNR value of the separation result by comparing sˆtest to tes.wav.
10. Plug in the recovered speech to the notebook with a player (so that the AI can listen to it).
P3: PLSI for Analyzing Twitter Stream [5 points]
1. twitter.mat holds two Term-Frequency (TF) matrices Xtr and Xte. It also contains YtrM at
and YteM at, the target variables in the one-hot vector format.
2. Each column of the TF matrix Xtr can be either “positive”, “negative”, or “neutral”, which
are represented numerically as 1, 2, and 3 in the YtrM at. They are sentimental classes of the
original twits.
3. Learn 50 PLSI topics B ∈ R
891×50 and their weights Θtr ∈ R
50×773 from the training data
Xtr, using the ordinary PLSI update rules.
4. Reduce the dimension of Xte down to 50, by learning the weight matrix Θte ∈ R
50×193. This
can be done by doing another PLSI on the test data Xte, but this time by reusing the topic
matrix B you learned from the training set. So, you skip the update rule for B. You only
update Θte ∈ R
5. Define a perceptron layer for the softmax classification. This part is similar to the case with
kernel PCA with a perceptron as you did in Homework #5. Instead of the kernel PCA results
as the input to the perceptron, you use Θtr for training, and Θte for testing. This time the
number of output units is 3 as there are three classes, and that’s why the target variable
YtrM at is with three elements. Review M6 S37-39 to review what softmax is.
6. Report your classification accuracy.
7. Note: do NOT use any deep learning framework that supports softmax implementation and
automatic gradient computation. Use your own implementation of softmax and backpropagation.