Description
Q1 (5 points): Run the Mallet NB learner (i.e., the trainer’s name is NaiveBayes) with train.vectors.txt
as the training data and test.vectors.txt as the test data. In your note file, write down the training
accuracy and the test accuracy.
Q2 (35 points): Write a script, build NB1.sh, that implements the Multi-variate Bernoulli NB
model. It builds a NB model from the training data, classifies the training and test data, and calculates
the accuracy.
• The learner should treat all features as binary; that is, the feature is considered present iff its
value is nonzero.
• The format is: build NB1.sh training data test data class prior delta cond prob delta model file
sys output > acc file
• training data and test data are the vector files in the text format (cf. train.vectors.txt).
• class prior delta is the δ used in add-δ smoothing when calculating the class prior P(c); cond prob delta
is the δ used in add-δ smoothing when calculating the conditional probability P(f | c).
• model file stores the values of P(c) and P(f | c) (cf. model1).
Comment lines start with “%”. The line for P(c) has the format “classname P(c) logprob”,
where logprob is 10-based log of P(c).
The line for P(f | c) has the format “featname classname P(f|c) logprob”, where logprob is
10-based log of P(f | c).
• sys output is the classification result on the training and test data (cf. sys1). Each line has the
following format:
instanceName true class label c1 p1 c2 p2 …, where pi = P(ci
| x) = P(ci,x)
P(x)
. The (ci
, pi) pairs
should be sorted according to the value of pi
in descending order.
• acc file shows the confusion matrix and the accuracy for the training and the test data (cf.
acc1).
• As always, model1, sys1, and acc1 are NOT gold standard. These files were created with a
much smaller training dataset.
Run build NB1.sh with train.vectors.txt as the training data, test.vectors.txt as the test data,
and class prior delta set to 0:
• Fill out Table 1 with different values of cond prob delta.
• Store the model file, sys output and acc file for the second row (when cond prob delta is 0.5)
under q2/.
Table 1: Results of your Bernoulli NB model
cond prob delta Training accuracy Test accuracy
0.1
0.5
1.0
Q3 (35 points): Write a script, build NB2.sh, that implements the multinomial NB model. Other
than the modeling (e.g., the features in the multinomial NB model are real-valued), everything else
(e.g., the input/output files) is the same as in Q2.
• Fill out Table 2.
• Store the model file, sys output and acc file for the second row (when cond prob delta is 0.5)
under q3/.
Table 2: Results of your multinomial NB model
cond prob delta Training accuracy Test accuracy
0.1
0.5
1.0
Submission: Submit the following to Canvas:
• Your note file readme.(txt | pdf ) that includes Table 1 and 2, and any notes that you want the
TA to read.
• hw.tar.gz that includes all the files specified in dropbox/18-19/572/hw3/submit-file-list, plus any
source code (and binary code) used by the shell scripts.
• Make sure that you run check hw3.sh before submitting your hw.tar.gz.