Description
1 Maximum Likelihood Estimation
Suppose we have N i.i.d (independent and identically distributed) data samples, i.e. X = {x
n}
N
n=1 from the
following probability distributions. This problem asks you to build a log-likelihood function for each
distribution (i.e. log P(X)), and find the maximum likelihood estimator of the parameter(s).
(a) Poisson distribution
The Poisson distribution is defined as
P(x = k) = λ
k
e
−λ
k!
(k = 0, 1, 2, …).
What is the maximum likelihood estimator of λ?
(b) Exponential distribution
The probability density function of Exponential distribution is given by
f(x) =
λe−λx x ≥ 0
0 x < 0
What is the maximum likelihood estimator of λ?
1
(c) Gaussian normal distribution
A univariate Gaussian normal distribution, N (µ, σ2
), is given by
N (x; µ, σ2
) = 1
σ
√
2π
exp
−
(x − µ)
2
2σ
2
.
What is the maximum likelihood estimator of µ and σ
2
?
2 Chance of Karl Die
… No, this isn’t a sinister Yoda-ish statement regarding my state of health. It refers to a type of dice
famously (in my mind) named after me (by me). The dice is such that the value of the current roll depends
on the value of the previous roll. More on that in a second – let’s set up first.
Let x
(n) be the n
th sequence of M dice rolls. We represent x
(n) as a vector of individual rolls, i.e.
x
(n) = (x
(n)
1
, x
(n)
2
, . . . , x
(n)
M ). Each Karl dice roll can take up to 12 values (that’s my Messiah complex kicking
in again), i.e. x
(n)
i ∈ {1, . . . , 12}. Every sequence of rolls is independent of the other, but the rolls within
a sequence are dependent on each other in the following way: first roll does not depend on any other roll,
second roll depends on the first, third depends on the second, etc….
We also have a Regular 12-sided dice in which individual rolls are completely independent from each
other. Every sequence of rolls uses a single dice (i.e. rolls within a sequence must be performed by the same
dice). For the n
th sequence, we have a label y
(n)
such that
y
(n) =
1 if Karl dice is used
0 if Regular dice is used
We collect all the data in X and y variables which collectively constitute our dataset D, i.e.
D = (X, Y ) = {(x
(n)
, y(n)
)}
N
n=1.
Our game is simple: given a sequence of rolls, how can we identify which type of dice was used to generate
the roll? We will solve this problem by estimating the parameters that maximize the likelihood of our data.
(a) Show that the likelihood function can be written as
p(D|h) = Y
N
n=1
h
π1K(x
(n)
)
iy
(n) h
π0R(x
(n)
)
i1−y
(n)
where
πi = p(y = i)
K(x
(n)
) = p(x
(n)
|y
(n) = 1)
R(x
(n)
) = p(x
(n)
|y
(n) = 0)
2
Note: if, during your derivation, you feel a step needs explanation, please feel free to describe
it in English (as opposed to?) as well.
(b) Show that the log-likelihood can be written as
log p(D|h) =X
N
n=1
y
(n)
”
π1 + log K(x
(n)
1
) +X
M
i=2
log K(x
(n)
i
|x
(n)
i−1
)
#
+
X
N
n=1
(1 − y
(n)
)
”
log π0 +
X
M
i=1
log R(x
(n)
i
)
#
(1)
where
K(x
(n)
1
) = p(x
(n)
1
|y
(n) = 1)
K(x
(n)
i
|x
(n)
i−1
) = p(x
(n)
i
|x
(n)
i−1
, y(n) = 1)
R(x
(n)
i
) = p(x
(n)
i
|y
(n) = 0)
(c) Now that we have that done, we can see that we have 4 sets of parameters. Specifically,
πi for i = 1, 2
K(v) for v = 1, . . . , 12
K(v|v
0
) for v = 1, . . . , 12, v0 = 1, . . . , 12
R(v) for v = 1, . . . , 12
Write the optimization constraints for each of the above sets of parameters.
(d) Write the Lagrangian form of the maximum (log) likelihood estimation problem.
(e) Show that the log-likelihood can be further expanded to the form
log p(D|h) =X
N
n=1
y
(n)
”
π1 +
X
v
I(x
(n)
1 = v) log K(x
(n)
1 = v)
+
X
M
i=2
X
v
X
v
0
I(x
(n)
i = v, x
(n)
i−1 = v
0
) log K(x
(n)
i = v|x
(n)
i−1 = v
0
)
#
+
X
N
n=1
(1 − y
(n)
)
”
log π0 +
X
M
i=1
X
v
I(x
(n)
i = v) log R(x
(n)
i = v)
#
where
I(a = b) =
1 if a = b
0 otherwise
I(a = b, c = d) =
1 if a = b AND c = d
0 otherwise
and P
v
is simply P
v∈{1,…,12}
(similarly for v
0
).
(f) Show that the MLE for π1 has the form
π1 =
1
N
X
N
n=1
y
(n)
3
where N is the number of data points.
(g) Show that the MLE estimate for K(v) has the form
K(v) = 1
N1
count1(v, y = 1)
where N1 is the number of data points where y = 1, and count1(v, y = 1) is the number of data
points with label y = 1 where feature 1 has the value v.
(h) Derive the MLE estimate for K(v|v
0
). You can’t expect me to give you all the answers…
but I’ll be nice – you may (will) find the following measure helpful
counti(v, v0
, y(n) = 1)
which is simply the number of data points with label y = 1, where the i
th feature has value v,
and the i − 1
th feature has value v
0
.
(i) Derive the MLE estimate for R(v).
4