Naive Bayes and Logistic Regression for Text Classification
In this homework you will implement and evaluate Naive Bayes and Logistic Regression for text
classification. Note from the TA: You must use Python 3 (at least 3.9) to implement
0 Points Download the spam/ham (ham is not spam) datasets (links are posted on Microsoft TEAMs;
see under homeworks). The datasets were used in the Metsis et al. paper . There are three
datasets. You have to perform the experiments described below on all the three datasets.
Each data set is divided into two (sub)sets: training set and test set. Each of them has two
directories: spam and ham. All files in the spam and ham folders are spam and ham messages
20 points Convert the text data into a matrix of features × examples (namely our canonical data
representation), using the following approaches.
– Bag of Words model: Recall that we use a vocabulary having w words—the set of
all unique words in the training set—and represent each email using a vector of word
frequencies (the number of times each word in the vocabulary appears in the email).
– Bernoulli model: As before we use a vocabulary having w words and represent each
email (training example) using a 0/1 vector of length w where 0 indicates that the word
does not appear in the email and 1 indicates that the word appears in the email.
Thus you will convert each of the three “text” datasets into two datasets, one using the Bag
of words model and the other using the Bernoulli model.
20 points Implement the multinomial Naive Bayes algorithm for text classification described here:
http://nlp.stanford.edu/IR-book/pdf/13bayes.pdf (see Figure 13.2 in the document).
Note that the algorithm uses add-one laplace smoothing. Make sure that you do all the calculations in log-scale to avoid underflow. Use your algorithm to learn from the training set
and report accuracy on the test set. Important: Use the datasets generated using the Bag
of words model and not the Bernoulli model for this part.
20 points Implement the discrete Naive Bayes algorithm we discussed in class. To prevent zeros, use
add-one laplace smoothing. Make sure that you do all the calculations in log-scale to avoid
underflow. Use your algorithm to learn from the training set and report accuracy on the test
set. Important: Use the datasets generated using the Bernoulli model and not the Bag of
words model for this part.
30 points Implement the MCAP Logistic Regression algorithm with L2 regularization that we discussed
in class (see Mitchell’s new book chapter). Try different values of λ. Divide the given training
set into two sets using a 70/30 split (namely the first split has 70% of the examples and the
second split has the remaining 30%). Learn parameters using the 70% split, treat the 30%
data as validation data and use it to select a value for λ.
Then, use the chosen value of
λ to learn the parameters using the full training set and report accuracy on the test set.
Use gradient ascent for learning the weights (you have to set the learning rate appropriately.
Otherwise, your algorithm may diverge or take a long time to converge). Do not run gradient
ascent until convergence; you should put a suitable hard limit on the number of iterations.
Important: Use the datasets generated using both the Bernoulli model and the Bag of words
model for this part.
10 points Run the SGDClassifier from scikit-learn on the datasets. Tune the parameters (e.g., loss
function, penalty, etc.) of the SGDClassifier using GridSearchCV in scikit-learn. Compare
the results you obtain for SGDClassifier with your implementation of Logistic Regression.
Important: Use the datasets generated using both the Bernoulli model and the Bag of
words model for this part.
Extra Credit Use the automatic parameter tying method described here (http://www.hlt.utdallas.edu/
~vgogate/papers/aaai18.pdf) on the three algorithms. Again, treat k (the number of
unique parameters) as a hyper-parameter and select it using the 70-30 split. Does parameter
tying improve accuracy of Naive Bayes, Logistic Regression or SGDClassifier?
What to Turn in
• Your code and a Readme file for compiling and executing your code.
• A detailed write up that reports the accuracy, precision, recall and F1 score obtained on the
three test sets for the following algorithms and feature types:
– Multinomial Naive Bayes on the Bag of words model
– Discrete Naive Bayes on the Bernoulli model
– Logistic Regression on both Bag of words and Bernoulli models
– SGDClassifier on both Bag of words and Bernoulli models
Your report should also describe how you tuned the hyper-parameters for logisitic regression
and SGDClassifier for each dataset (specifically values of λ used, hard limit on the number of
iterations, etc.). The write up should be self contained in that we should be able to replicate
• Answer the following questions:
1. Which data representation and algorithm combination yields the best performance (measured in terms of the accuracy, precision, recall and F1 score) and why?
2. Does Multinomial Naive Bayes perform better (again performance is measured in terms
of the accuracy, precision, recall and F1 score) than LR and SGDClassifier on the Bag
of words representation? Explain your yes/no answer.
3. Does Discrete Naive Bayes perform better (again performance is measured in terms of
the accuracy, precision, recall and F1 score) than LR and SGDClassifier on the Bernoulli
representation? Explain your yes/no answer.
4. Does your LR implementation outperform the SGDClassifier (again performance is measured in terms of the accuracy, precision, recall and F1 score) or is the difference in
performance minor? Explain your yes/no answer.
 V. Metsis, I. Androutsopoulos and G. Paliouras, “Spam Filtering with Naive Bayes – Which Naive
Bayes?”. Proceedings of the 3rd Conference on Email and Anti-Spam (CEAS 2006), Mountain
View, CA, USA, 2006.