Homework Set Four ECE 271A solution

$29.99

Original Work ?

Download Details:

  • Name: HW4-Expectation-Maximization-zqnnju.zip
  • Type: zip
  • Size: 899.50 KB

Category: You will Instantly receive a download link upon Payment||Click Original Work Button for Custom work

Description

5/5 - (4 votes)

1. BDR and nearest neighbors Consider a classification problem with c classes and uniform class
probabilities, i.e. PY (i)=1/c, ∀i. Assume that the goal is to classify an iid sequence of observations
X = {x1,…,xn} as a whole (i.e. the samples are not classified one at a time).
a) Compute the BDR for this problem and show that it converges (in probability) to a nearest neighbor
rule based on the class-conditional distributions and the distribution of the observations. Show that
the distance function is the Kullback-Leibler divergence
D[p(x)||q(x)] =
p(x) log p(x)
q(x)
dx.
This proves that the BDR for the classification of sequences is really just a nearest neighbor rule.
b) Assuming that all densities are Gaussian with equal covariance Σ, the class conditional densities
have mean μi and the observation density has mean μ write down an expression for the decision rule as
a function of the Gaussian parameters. Provide an interpretation for this new decision rule, by stating
what are the items being compared and what is the distance function.
2. Kernel methods Problem 4.3.3 in DHS
1
3. Multinomial EM In this problem we consider an example where there is a closed-form solution
to ML estimation from incomplete data. The goal is to compare with the EM solution and get some
insight on how the steps of the latter can be substantially easier to derive than the former.
Consider our bridge example and let U be the type of vehicle that crosses the bridge. U that can take
4 values, (compact, sedan, station wagon, and pick-up truck) that we denote by U ∈ {1, 2, 3, 4}. On a
given day, an operator collects an iid sample of size n from U and the number of vehicles of each type
is counted and stored in a vector D = (x1, x2, x3, x4). The resulting random variable X (the histogram
of vehicle classes) has a multinomial distribution
PX1,X2,X3,X4 (x1, x2, x3, x4; Ψ) = n!
x1!x2!x3!x4!
1
2 +
1
4
Ψ
x1 1
4 − 1
4
Ψ
x2 1
4 − 1
4
Ψ
x3 1
4
Ψ
x4
.
However, it is later realized that the operator included motorcycles in the compact class. It is established
that bikes have probability 1
4Ψ, which leads to a new model
PX11,X12,X2,X3,X4 (x11, x12, x2, x3, x4; Ψ) =
= n!
x11!x12!x2!x3!x4!
1
2
x11 1
4
Ψ
x12 1
4 − 1
4
Ψ
x2 1
4 − 1
4
Ψ
x3 1
4
Ψ
x4
.
Determining the parameter Ψ from the available data is as a problem of ML estimation with missing
data, since we only have measurements for
x1 = x11 + x12
but not for x11 and x12 independently.
a) Determine the value of Ψ that maximizes the likelihood of D, i.e.
Ψ∗
i = arg max Ψ
PX1,X2,X3,X4 (D; Ψ)
by using standard ML estimation procedures.
b) Assume that we have the complete data, i.e. Dc = (x11, x12, x2, x3, x4). Determine the value of Ψ
that maximizes its likelihood, i.e.
Ψ∗
c = arg max Ψ
PX11,X12,X2,X3,X4 (Dc; Ψ),
by using standard ML estimation procedures. Compare the difficulty of obtaining this solution vs. that
of obtaining the solution in a). Does this look like a problem where EM might be helpful?
c) Derive the E and M-steps of the EM algorithm for this problem.
d) Using the equations for the EM steps, determine the fixed point of the algorithm (i.e. the solution)
by making
Ψk+1 = Ψk
where k is the iteration number. Compare to the solution obtained in a).
2
4. Mixtures of Gaussians The goal of this problem is to give you some “hands-on” experience on
the very important case of EM as a tool for the estimation of the parameters of a mixture. Consider a
mixture of two Gaussians
PX(x) = 
2
c=1
πcG(x, μc, Σc)
where the covariance matrices are diagonal, i.e. Σc = diag(σ2
c,1, σ2
c,2), and a training sample of five
points
D = {(−2.5, −1),(−2, 0.5),(−1, 0),(2.5, −1),(2, 1)}.
a) Assume that the following hold
μ1 = −μ2
Σ1 = Σ1 = σ2I,
π1 = π2 = 1
2
.
Plot the log-likelihood surface log PX(D) as a function of the mean parameters (entries of μ1) for
σ2 ∈ {0.1, 1, 2}. Let the coordinate axis cover the range ([−5, 5]). What can you say about the local
maxima of the likelihood surface, and how it changes with σ2? How does the convergence to the optimal
depend on the location of the initial parameter guess?
b) Starting from the initial parameter estimate
π(0)
1 = π(0)
2 = 1
2
μ(0)
1 = −μ(0)
2 = (−0.1, 0)
Σ(0)
1 = Σ(0)
1 = I,
compute all the quantities involved in the first 3 iterations of the EM algorithm. For each iteration
produce
• plot 1: the posterior surface PZ|X(1|x) for the first class as a function of x,
• plot 2: the mean of each Gaussian, the contour where the Mahalanobis distance associated with
it becomes 1, the points in D, and the means of the solutions obtained the previous steps.
Let EM run until convergence, storing the mean estimates at each iteration. Produce the two plots
above for the final solution. In plot 2, plot the values of the means as they progress from the initial to
the final estimate.
5. EM and MAP estimates In this problem we use EM for the maximization of the posterior
probability
Ψ∗ = arg max Ψ
PΨ|X(Ψ|x).
Consider the binomial distribution of problem 3. and a Gamma prior
PΨ(Ψ) = Γ(ν1 + ν2)
Γ(ν1)Γ(ν2)
Ψν1−1(1 − Ψ)ν2−1.
Derive the equations of the EM algorithm for MAP estimation of the parameter Ψ.
3
6. (Computer) This week we use the cheetah image to evaluate the performance of a classifier based on
mixture models estimated with EM. Once again we use the decomposition into 8×8 image blocks, compute the DCT of each block, and zig-zag scan. For this (using the data in TrainingSamplesDCT new 8.mat)
we fit a mixture of Gaussians of diagonal covariance to each class, i.e.
PX|Y (x|i) = 
C
c=1
πcG(x, μc, Σc)
where all Σc are diagonal matrices. We then apply the BDR based on these density estimates to the
cheetah image and measure the probability of error as a function of the number of dimensions of the
space (as before, use {1, 2, 4, 8, 16, 24, 32,…, 64} dimensions).
a) For each class, learn 5 mixtures of C = 8 components, using a random initialization (recall that the
mixture weights must add up to one). Plot the probability of error vs. dimension for each of the 25
classifiers obtained with all possible mixture pairs. Comment the dependence of the probability of error
on the initialization.
b) For each class, learn mixtures with C ∈ {1, 2, 4, 8, 16, 32}. Plot the probability of error vs. dimension
for each number of mixture components. What is the effect of the number of mixture components on
the probability of error?
4