## Description

In this assignment, we’ll fit both generative and discriminative models to the MNIST dataset

of handwritten numbers. Each datapoint in the MNIST [http://yann.lecun.com/exdb/mnist/]

dataset is a 28×28 black-and-white image of a number in {0 . . . 9}, and a label indicating which

number.

MNIST is the ’fruit fly’ of machine learning – a simple standard problem useful for comparing

the properties of different algorithms. Python and R codes for loading and plotting MNIST is

attached.

You can use whichever programming language you like, and libraries for loading and plotting

data. You’ll need to write your own initialization, fitting, and prediction code. You can use automatic differentiation in your code, but must still answer the gradient questions.

For this assignment, we’ll binarize the dataset, converting the grey pixel values to either black

or white (0 or 1) with > 0.5 being the cutoff. When comparing models, we’ll need a training and

test set. Use the first 10000 samples for training, and another 10000 for testing. This is all done

for you in the starter code. Hint: Also build a dataset of only 100 training samples to use when

debugging, to make loading and training faster.

1. Fitting a Na¨ıve Bayes Model, 25 pts. In this question, we’ll fit a na¨ıve Bayes model

to the MNIST digits using maximum likelihood. Na¨ıve Bayes defines the joint probability of the

each datapoint x and its class label c as follows:

p(x, c|θ, π) = p(c|π)p(x|c, θc) = p(c|π)

Y

784

d=1

p(xd|c, θcd)

Here, θ is a matrix of probabilities for each pixel and each class, so its total size is 784 × 10.

For binary data, we can use the Bernoulli likelihood:

p(xd|c, θcd) = Ber(xd|θcd) = θ

xd

cd (1 − θcd)

(1−xd)

Which is just a way of expressing that p(xd = 1|c, θcd) = θcd.

For p(c|π), we can just use a categorical distribution:

p(c|π) = Cat(c|π) = πc

Note that we need P9

i=0 πi = 1.

1

(a) Derive the maximum likelihood estimate (MLE) for the class-conditional pixel means θ. Hint:

We saw in lecture that MLE can be thought of as ‘counts’ for the data, so what should ˆθcd

be counting?

(b) Derive the maximum a posteriori (MAP) estimate for the class-conditional pixel means θ,

using a Beta(2, 2) prior on each θ. Hint: it has a simple final form, and you can ignore the

Beta normalizing constant.

(c) Fit θ to the training set using the MAP estimator. Plot θ as 10 separate greyscale images,

one for each class.

(d) Derive the log-likelihood log p(c|x, θ, π) for a single training image.

(e) Given parameters fit to the training set, and πc =

1

10 , report both the average log-likelihood

per datapoint, 1

N Σ

N

i=1 log p(ci

|xi

, θ, π) and the accuracy on both the training and test set. The

accuracy is defined as the fraction of examples where the true class t = argmaxc p(c|x, θ, π).

2. Generating from a Na¨ıve Bayes Model, 25 pts. Defining a joint probabilistic model

over the data lets us generate new data, and also lets us answer all sorts of queries about the

data.

(a) True or false: Given this model’s assumptions, any two pixels xi and xj where i 6= j are

independent given c.

(b) True or false: Given this model’s assumptions, any two pixels xi and xj where i 6= j are

independent when marginalizing over c.

(c) Using the parameters fit in question 1, produce random image samples from the model. That

is, randomly sample and plot 10 binary images from the marginal distribution p(x|θ,π).

Hint: first sample c.

(d) One of the advantages of generative models is that they can handle missing data, or be

used to answer different sorts of questions about the model. Derive p(xi∈bottom|xtop, θ,π),

the marginal distribution of a single pixel in the bottom half of an image given the top half,

conditioned on your fit parameters. Hint: you’ll have to marginalize over c.

(e) For 20 images from the training set, plot the top half the image concatenated with the

marginal distribution over each pixel in the bottom half. i.e. the bottom half of the image

should use grayscale to represent the marginal probability of each pixel being on.

3. Logistic Regression, 25 pts. Now, we’ll fit a simple predictive model using gradient

descent. Our model will be multiclass logistic regression:

p(c|x, w) = exp(wT

c x)

P9

c

0=0 exp(wT

c

0x)

Omit bias parameters for this question.

(a) How many parameters does this model have?

(b) Derive the gradient of the predictive log-likelihood w.r.t. w: ∇w log p(c|x, w)

(c) Code up a gradient-based optimizer of your choosing, it can be just vanilla gradient descent,

and use it to optimize w to maximize the log-likelihood of the training set, and plot the

resulting parameters using one image per class. Since this objective is concave, you can initialize at all zeros. You can use the gradient ascent (since it is a maximization) updates as

covered in class or use automatic differentiation (autograd). However, you are not permitted to use optimizers which come built in to packages! Hint: Use scipy.logsumexp or its

2

equivalent to make your code more numerically stable. Also, try to avoid nested for loops,

and instead use matrix operations to keep your code fast.

(d) Given parameters fit to the training set, report both the average log-likelihood per datapoint,

and the accuracy on both the training and test set. How does it compare to Na¨ıve Bayes?

4. Unsupervised Learning, 25 pts. In this problem, we will experiment with two clustering algorithms: (i) k-means, and (ii) EM algorithm for Gaussian mixtures. In what follows, N

denotes the number of data points, D denotes the dimension of each data point, xi ∈ R

D denotes

a single data point, K denotes the number of clusters, and xi ∼ p(x|Ck) for k = 1, …, K denotes

the data generating distribution.

(a) First, we will generate some data for this problem. Set the number of points N = 400,

their dimension D = 2, and the number of clusters K = 2, and generate data from the

distribution p(x|Ck) = N (µk, Σk). Sample 200 data points for C1 and 200 for C2, with

µ1 =

0.1

0.1

, µ2 =

6.0

0.1

and Σ1 = Σ2 =

10 7

7 10

Here, N = 400. Since you generated the data, you already know which sample comes from

which class. Make a scatter plot of the data points showing the true cluster assignment of

each point either using different color codes or shape or both.

(b) Now, we assume that the true class labels are not known. Implement the k-means algorithm

for this problem. Write three functions: cost, km_e_step, and km_m_step as given in the

lecture (Here, km_ means k-means). Identify the correct arguments, and the order to run

them. Initialize the algorithm with

µˆ1 =

0.0

0.0

, ˆµ2 =

1.0

1.0

(4.1)

and run it until convergence. Show the resulting cluster assignments on a scatter plot either

using different color codes or shape or both. Also plot the cost vs. the number of iterations.

Report your misclassification error.

(c) Next, implement the EM algorithm for Gaussian mixtures. Write four functions: normal_density,

log-likelihood, em_e_step, and em_m_step as given in the lecture. Identify the correct

arguments, and the order to run them. Initialize the algorithm with means as in (4.1), covariances with Σˆ

1 = Σˆ

2 = I, and ˆπ1 = ˆπ2. Run the algorithm until convergence and show the

resulting cluster assignments on a scatter plot either using different color codes or shape or

both. Also plot the log-likelihood vs. the number of iterations. Report your misclassification

error.

(d) Comment on your findings, i.e., resulting cluster assignments, number of iterations until

convergence, etc. Experiment with different data realizations (generate new data), run your

algorithms, and summarize your findings. Does the algorithm performance depend on different realizations of data? Hint: In most cases, k-means should fail to identify the correct

cluster labels due to the covariance structure. There may be realizations that EM also fails

to find the clusters correctly but in general it should work better than k-means.

3