## Description

Problem 1 (15 points). In class we said that for a generative model (e.g., Naive Bayes), the

optimal estimator will be the maximum aposteriori probability (MAP) estimator that when given

a feature X, outputs the label that satisfies:

arg max

y∈Y

p(y|X).

The maximum likelihood (ML) estimator outputs the label that maximizes the likelihood (probability) of the observation (which is the feature X):

arg max

y∈Y

p(X|y).

In this problem we will see that this is the predictor with the least error probability if the

underlying data is generated from the model.

We will simplify the setting by considering a binary classification task, where the labels have

two possible values, say Y = {−1, +1}. Suppose the model that generates the data is p(X, y),

which is known.

1. What is the distribution of y when we observe a feature X?

2. Suppose we predict the label −1 upon seeing X. Show that the probability of error is p(y =

+1|X).

3. Use this to argue that for any X the prediction to minimize the error probability is

max

y∈{−1,+1}

p(y|X).

This shows that the MAP estimator is the optimal estimator for the binary task. This also

extends to larger Y.

1

4. Show that if the distribution over the labels is uniform, namely p(y = −1) = p(y = +1) = 0.5,

then the MAP estimator and ML estimator are the same.

5. Construct any generative model where the MAP and ML estimator are not the same.

Problem 2. (10 points). ML vs MAP and add constant smoothing. Suppose you generate

n independent coin tosses using a coin with bias µ (i.e. the probability of getting a head for each

toss). Let A(nH, nT ) denote the event of getting nH heads and nT = n − nH tails. Show the

following:

1. According to maximum likelihood principle, show that your estimate for µ should be:

µˆ =

nH

nH + nT

.

Hint: Given that the bias is µ, show that:

p(A(nH, nT )|µ) =

nH + nT

nT

µ

nH (1 − µ)

nT .

2. Consider the following generative process: First generate µ, the bias of the coin, according to

the prior distribution p(µ). Then generate n independent coin tosses with bias µ. Show that

arg max

µ

p(µ|A(nH, nT )) = arg max

µ

p(µ)p(A(nH, nT )|µ).

3. Let the prior of the bias be a Beta distribution, which is a distribution over [0, 1]

p(µ) = µ

α(1 − µ)

β

R 1

0

x

α(1 − x)

βdµ

,

show that:

arg max

µ

p(µ|nH, nT ) = nH + α

nH + α + nT + β

.

Remark: This shows that add constant smoothing is equivalent to inducing a Beta distribution prior on the parameter of the generating model.

Problem 3. (10 points). Consider the Tennis data set. In this problem, you don’t need the

smoothing constant when estimating the prior probabilities of the labels.

1. For β = 1 (smoothing constant), write down the probabilities of all the features conditioned

on the labels. The total number of probabilities you need to compute should not be more

than twenty.

2. What are the prior probabilities of the labels?

3. For a new label (Overcast, Hot, High, Strong), what does the Naive Bayes classifier predict

for β = 0, β = 1, and for β → ∞?

2

Problem 4. (15 points). Recall the Gaussian distribution with mean µ, and variance σ

2

. The

density is given by:

pµ,σ2 (x) = 1

√

2πσ2

exp

−

(x − µ)

2

2σ

2

.

Given n independent samples X1, . . . , Xn from the same Gaussian distribution with unknown mean,

and variance, let µML, and σ

2

ML denote the maximum likelihood estimates of mean and variance.

Hint: (1) What is the joint density p(x1, . . . , xn) for i.i.d Gaussian random variables X1, . . . , Xn?

(2) If f(x) > 0, maximizing f(x) is equivalent as maximizing log[f(x)].

1. Show that

µML =

Pn

i=1 Xi

n

.

2. Show that

σ

2

ML =

1

n

Xn

i=1

(Xi − µML)

2

!

.

3. What is the expectation and variance of µML?

Problem 5. (20 points). See attached Jupyter Notebook for reference.

3