Description
1. Basic optimization. (50 points.)
The background of logistic regression will be discussed in the next lecture. Here, we just focus on
finding out the property of the optimization problem, related to training a logistic regression.
Consider a simplified logistic regression problem. Given n training samples (xi
, yi), i = 1, . . . , n. The
data xi ∈ R, and yi ∈ {0, 1}. To train/fit a logistic regression model for classification, we solve the
following optimization problem, where θ ∈ R is a parameter we aim to find:
max
θ
`(θ), (1)
where the log-likelhood function
`(θ) = Xn
i=1
{− log(1 + exp{−θxi}) + (yi − 1)θxi} .
(a) (15 points) Derive the gradient of the cost function `(θ) in (1) and write a pseudo-code for
performing gradient descent to find the optimizer θ
∗
. This is essentially what the training
procedure does. pseudo-code means you will write down the steps of the algorithm, not necessarily
any specific programming language.
(b) (15 points) Present a stochastic gradient descent algorithm to solve the training of logistic
regression problem (1).
(c) (20 points) We will show that the training problem in basic logistic regression problem
is concave. Derive the Hessian matrix of `(θ) and based on this, show the training problem (1)
is concave. Explain why the problem can be solved efficiently and gradient descent will achieve a
unique global optimizer, as we discussed in class.
2. Comparing Bayes, logistic, and KNN classifiers. (50 points)
In lectures we learn three different classifiers. This question is to implement and compare them. We are
suggest use Scikit-learn, which is a commonly-used and powerful Python library with various machine
learning tools. But you can also use other similar library in other languages of your choice to perform
the tasks.
1
Part One (Divorce classification/prediction). (30 points)
This dataset is about participants who completed the personal information form and a divorce predictors scale.
The data is a modified version of the publicly available at https://archive.ics.uci.edu/ml/datasets/
Divorce+Predictors+data+set (by injecting noise so you will not replicate the results on uci website). There are 170 participants and 54 attributes (or predictor variables) that are all real-valued. The
dataset q3.csv. The last column of the CSV file is label y (1 means “divorce”, 0 means “no divorce”).
Each column is for one feature (predictor variable), and each row is a sample (participant). A detailed
explanation for each feature (predictor variable) can be found at the website link above. Our goal is
to build a classifier using training data, such that given a test sample, we can classify (or essentially
predict) whether its label is 0 (“no divorce”) or 1 (“divorce”).
Build three classifiers using (Naive Bayes, Logistic Regression, KNN). Use the first 80% data for
training and the remaining 20% for testing. If you use scikit-learn you can use train test split to split
the dataset.
(a) Report testing accuracy for each of the three classifiers. Comment on their performance: which
performs the best and make a guess why they perform the best in this setting.
(b) Use the first two features to train three new classifiers. Plot the data points and decision boundary of each classifier. Comment on the difference between the decision boundary for the three
classifiers. Please clearly represent the data points with different labels using different colors.
Part Two (Handwritten digits classification). (20 points) Repeat the above using the MNIST
Data in our Homework 3. Here, give “digit” 6 label y = 1, and give “digit” 2 label y = 0. All the
pixels in each image will be the feature (predictor variables) for that sample (i.e., image). Our goal
is to build classifier to such that given a new test sample, we can tell is it a 2 or a 6. Using the first
80% of the samples for training and remaining 20% for testing. Report the classification accuracy on
testing data, for each of the three classifiers. Comment on their performance: which performs the best
and make a guess why they perform the best in this setting.
3. Deriving M-step in EM for GMM. (Bonus question: 10 points)
Consider the Q function in EM for GMM:
Q(θ|θk) = Xn
i=1
X
C
c=1
pi,c log πc +
Xn
i=1
X
C
c=1
pi,c log φ(xi
|µc, Σc)
where θ = {πc, µc, Σc}c=1,…,C , and pi,c ∝ π
(k)
c φ(xi
|µ
(k)
c , Σ
(k)
c ), i = 1, . . . , n, c = 1, . . . , C depend on the
parameters from the previous iteration. Here φ(x|µ, Σ) denotes the pdf of a multi-variate Gaussian
with mean vector µ and covariance matrix Σ.
Solve πc, µc and Σc that maximize the Q(θ|θk) function. In other words, we want to find
θk+1 := arg max
θ
Q(θ|θk)
subject to X
C
c=1
πc = 1.
Show that the solution θk+1 is given by the following expression
π
(k+1)
c =
Pn
i=1 pi,c
n
µ
(k+1)
c =
Pn
i=1
P
pi,cxi
c
i=1 pi,c
2
Σ
(k+1)
c =
Pn
i=1 pi,c(xi − µ
(k)
c )(xi − µ
(k)
c )
T
Pn
i=1 pi,c
(Hint: Consider the Lagrangian function for solving this constrained optimization problem. You only
need to introduce one Lagrangian multiplier because you only have one constraint. Then solve it from
there.)
4. Naive Bayes for spam filtering. (Bonus question: 20 points)
In this problem we will use the Naive Bayes algorithm to fit a spam filter by hand. This will enhance
your understanding to Bayes classifier and build intuition.
Spam filters are used in all email services to classify received emails as “Spam” or “Not Spam”. A
simple approach involves maintaining a vocabulary of words that commonly occur in “Spam” emails
and classifying an email as “Spam” if the number of words from the dictionary that are present in the
email is over a certain threshold. We are given the vocabulary consists of 15 words
V = {secret, offer, low, price, valued, customer, today, dollar, million, sports, is, for, play, healthy, pizza}.
We will use Vi to represent the ith word in V . As our training dataset, we are also given 3 example
spam messages,
• million dollar offer
• secret offer today
• secret is secret
and 4 example non-spam messages
• low price for valued customer
• play secret sports today
• sports is healthy
• low price pizza
Recall that the Naive Bayes classifier assumes the probability of an input x = [x1, x2, . . . , xn]
T depends
on its class y. In our case the input vector x corresponding to each message has length n = 15 equal
to the number of words in the vocabulary V , where each entry xi
is equal to the number of times word
Vi occurs in x.
(a) Calculate P(y = 0) and P(y = 1) from the training data, where y = 0 corresponds to spam
messages, and y = 1 corresponds to non-spam messages.
(b) List the feature vector x for each spam and non-spam message.
(c) In the Naive Bayes model, the likelihood of a sentence with feature vector x given a class c is
P(x|y = c) = Yn
k=1
θ
xk
c,k
where θc,k ∈ (0, 1) is the weight of word k in class c, which satisfies Pn
k=1 θc,k = 1, ∀c. Calculate
the maximum likelihood estimates of θ0,1, θ0,7, θ1,1, θ1,15 by maximizing P(x|y = c) with respect
to θc,k and given data. (Hint: Consider the Lagrangian function for solving this constrained
optimization problem. You only need to introduce one Lagrangian multiplier because you only
have one constraint. Consider log-likelihood (i.e., taking log of the cost function). Then solve it
from there.)
(d) Given a new message “today is secret”, decide whether it is spam or not spam, based on the Naive
Bayes classifier, learned from the above data.
3

