# HOMEWORK 2 – V1 Sta414/Sta2104 Na¨ıve Bayes Model solution

\$30.00

Original Work
Category:

## Description

In this assignment, we’ll fit both generative and discriminative models to the MNIST dataset
of handwritten numbers. Each datapoint in the MNIST [http://yann.lecun.com/exdb/mnist/]
dataset is a 28×28 black-and-white image of a number in {0 . . . 9}, and a label indicating which
number.
MNIST is the ’fruit fly’ of machine learning – a simple standard problem useful for comparing
the properties of different algorithms. Python and R codes for loading and plotting MNIST is
attached.
You can use whichever programming language you like, and libraries for loading and plotting
data. You’ll need to write your own initialization, fitting, and prediction code. You can use automatic differentiation in your code, but must still answer the gradient questions.
For this assignment, we’ll binarize the dataset, converting the grey pixel values to either black
or white (0 or 1) with > 0.5 being the cutoff. When comparing models, we’ll need a training and
test set. Use the first 10000 samples for training, and another 10000 for testing. This is all done
for you in the starter code. Hint: Also build a dataset of only 100 training samples to use when
1. Fitting a Na¨ıve Bayes Model, 25 pts. In this question, we’ll fit a na¨ıve Bayes model
to the MNIST digits using maximum likelihood. Na¨ıve Bayes defines the joint probability of the
each datapoint x and its class label c as follows:
p(x, c|θ, π) = p(c|π)p(x|c, θc) = p(c|π)
Y
784
d=1
p(xd|c, θcd)
Here, θ is a matrix of probabilities for each pixel and each class, so its total size is 784 × 10.
For binary data, we can use the Bernoulli likelihood:
p(xd|c, θcd) = Ber(xd|θcd) = θ
xd
cd (1 − θcd)
(1−xd)
Which is just a way of expressing that p(xd = 1|c, θcd) = θcd.
For p(c|π), we can just use a categorical distribution:
p(c|π) = Cat(c|π) = πc
Note that we need P9
i=0 πi = 1.
1
(a) Derive the maximum likelihood estimate (MLE) for the class-conditional pixel means θ. Hint:
We saw in lecture that MLE can be thought of as ‘counts’ for the data, so what should ˆθcd
be counting?
(b) Derive the maximum a posteriori (MAP) estimate for the class-conditional pixel means θ,
using a Beta(2, 2) prior on each θ. Hint: it has a simple final form, and you can ignore the
Beta normalizing constant.
(c) Fit θ to the training set using the MAP estimator. Plot θ as 10 separate greyscale images,
one for each class.
(d) Derive the log-likelihood log p(c|x, θ, π) for a single training image.
(e) Given parameters fit to the training set, and πc =
1
10 , report both the average log-likelihood
per datapoint, 1
N Σ
N
i=1 log p(ci
|xi
, θ, π) and the accuracy on both the training and test set. The
accuracy is defined as the fraction of examples where the true class t = argmaxc p(c|x, θ, π).
2. Generating from a Na¨ıve Bayes Model, 25 pts. Defining a joint probabilistic model
over the data lets us generate new data, and also lets us answer all sorts of queries about the
data.
(a) True or false: Given this model’s assumptions, any two pixels xi and xj where i 6= j are
independent given c.
(b) True or false: Given this model’s assumptions, any two pixels xi and xj where i 6= j are
independent when marginalizing over c.
(c) Using the parameters fit in question 1, produce random image samples from the model. That
is, randomly sample and plot 10 binary images from the marginal distribution p(x|θ,π).
Hint: first sample c.
(d) One of the advantages of generative models is that they can handle missing data, or be
used to answer different sorts of questions about the model. Derive p(xi∈bottom|xtop, θ,π),
the marginal distribution of a single pixel in the bottom half of an image given the top half,
conditioned on your fit parameters. Hint: you’ll have to marginalize over c.
(e) For 20 images from the training set, plot the top half the image concatenated with the
marginal distribution over each pixel in the bottom half. i.e. the bottom half of the image
should use grayscale to represent the marginal probability of each pixel being on.
3. Logistic Regression, 25 pts. Now, we’ll fit a simple predictive model using gradient
descent. Our model will be multiclass logistic regression:
p(c|x, w) = exp(wT
c x)
P9
c
0=0 exp(wT
c
0x)
Omit bias parameters for this question.
(a) How many parameters does this model have?
(b) Derive the gradient of the predictive log-likelihood w.r.t. w: ∇w log p(c|x, w)
and use it to optimize w to maximize the log-likelihood of the training set, and plot the
resulting parameters using one image per class. Since this objective is concave, you can initialize at all zeros. You can use the gradient ascent (since it is a maximization) updates as
covered in class or use automatic differentiation (autograd). However, you are not permitted to use optimizers which come built in to packages! Hint: Use scipy.logsumexp or its
2
equivalent to make your code more numerically stable. Also, try to avoid nested for loops,
(d) Given parameters fit to the training set, report both the average log-likelihood per datapoint,
and the accuracy on both the training and test set. How does it compare to Na¨ıve Bayes?
4. Unsupervised Learning, 25 pts. In this problem, we will experiment with two clustering algorithms: (i) k-means, and (ii) EM algorithm for Gaussian mixtures. In what follows, N
denotes the number of data points, D denotes the dimension of each data point, xi ∈ R
D denotes
a single data point, K denotes the number of clusters, and xi ∼ p(x|Ck) for k = 1, …, K denotes
the data generating distribution.
(a) First, we will generate some data for this problem. Set the number of points N = 400,
their dimension D = 2, and the number of clusters K = 2, and generate data from the
distribution p(x|Ck) = N (µk, Σk). Sample 200 data points for C1 and 200 for C2, with
µ1 =

0.1
0.1

, µ2 =

6.0
0.1

and Σ1 = Σ2 =

10 7
7 10
Here, N = 400. Since you generated the data, you already know which sample comes from
which class. Make a scatter plot of the data points showing the true cluster assignment of
each point either using different color codes or shape or both.
(b) Now, we assume that the true class labels are not known. Implement the k-means algorithm
for this problem. Write three functions: cost, km_e_step, and km_m_step as given in the
lecture (Here, km_ means k-means). Identify the correct arguments, and the order to run
them. Initialize the algorithm with
µˆ1 =

0.0
0.0

, ˆµ2 =

1.0
1.0

(4.1)
and run it until convergence. Show the resulting cluster assignments on a scatter plot either
using different color codes or shape or both. Also plot the cost vs. the number of iterations.
(c) Next, implement the EM algorithm for Gaussian mixtures. Write four functions: normal_density,
log-likelihood, em_e_step, and em_m_step as given in the lecture. Identify the correct
arguments, and the order to run them. Initialize the algorithm with means as in (4.1), covariances with Σˆ
1 = Σˆ
2 = I, and ˆπ1 = ˆπ2. Run the algorithm until convergence and show the
resulting cluster assignments on a scatter plot either using different color codes or shape or
both. Also plot the log-likelihood vs. the number of iterations. Report your misclassification
error.
(d) Comment on your findings, i.e., resulting cluster assignments, number of iterations until
convergence, etc. Experiment with different data realizations (generate new data), run your
algorithms, and summarize your findings. Does the algorithm performance depend on different realizations of data? Hint: In most cases, k-means should fail to identify the correct
cluster labels due to the covariance structure. There may be realizations that EM also fails
to find the clusters correctly but in general it should work better than k-means.
3