Description
PART I: PROBLEM SET
Your solutions to the problems will be submitted as a single PDF document. Be certain that your problems
are well-numbered and that it is clear what your answers are. Additionally, you will be required to duplicate
your answers to particular problems in the README file that you will submit.
1 Probability decision boundary (10pts)
Consider a case where we have learned a conditional probability distribution P(y | x). Suppose there are
only two classes, and let p0 = P(y = 0 | x) and let p1 = P(y = 1 | x). A loss matrix gives the cost that
is incurred for each element of the confusion matrix. (E.g., true positives might cost nothing, but a false
positive might cost us $10.) Consider the following loss matrix:
y = 0 (true) y = 1 (true)
yˆ = 0 (predicted) 0 10
yˆ = 1 (predicted) 5 0
(a) Show that the decision ˆy that minimizes the expected loss is equivalent to setting a probability threshold
θ and predicting ˆy = 0 if p1 < θ and ˆy = 1 if p1 ≥ θ.
(b) What is the threshold for this loss matrix?
2 Double counting the evidence (15pts)
Consider a problem in which the binary class label Y ∈ {T, F} and each training example x has 2 binary
attributes X1, X2 ∈ {T, F}.
Let the class prior be p(Y = T) = 0.5 and p(X1 = T | Y = T) = 0.8 and p(X2 = T | Y = T) = 0.5. Likewise,
p(X1 = F | Y = F) = 0.7 and p(X2 = F | Y = F) = 0.9. Attribute X1 provides slightly stronger evidence
about the class label than X2.
Assume X1 and X2 are truly independent given Y . Write down the naive Bayes decision rule.
(a) What is the expected error rate of naive Bayes if it uses only attribute X1? What if it uses only X2?
The expected error rate is the probability that each class generates an observation where the decision
rule is incorrect. If Y is the true class label, let Yˆ (X1, X2) be the predicted class label. Then the
expected error rate is p(X1, X2, Y | Y 6= Yˆ (X1, X2)).
(b) Show that if naive Bayes uses both attributes, X1 and X2, the error rate is 0.235, which is better than
if using only a single attribute (X1 or X2).
(c) Now suppose that we create new attribute X3 that is an exact copy of X2. So for every training example,
attributes X2 and X3 have the same value. What is the expected error of naive Bayes now?
(d) Briefly explain what is happening with naive Bayes (2 sentences max).
(e) Does logistic regression suffer from the same problem? Briefly explain why (2 sentences max).
3
PART II: PROGRAMMING EXERCISES
1 Challenge: Generalizing to Unseen Data (50 pts)
One of the most difficult aspects of machine learning is that your classifier must generalize well to unseen
data. In this exercise, we are supplying you with labeled training data and unlabeled test data. Specifically,
you will not have access to the labels for the test data, which we will use to grade your assignment. You
will fit the best model that you can to the given data and then use that model to predict labels for the test
data. It is these predicted labels that you will submit, and we will grade your submission based on your test
accuracy (relative to the best performance you should be able to obtain). Each instance belongs to one of
nine classes, named ’1’ . . . ’9’. We will not provide any further information on the data set.
You will submit two sets of predictions – one based on a boosted decision tree classifier (which you will
write), and another set of predictions based on whatever machine learning method you like – you are free to
choose any classification method. We will compute your test accuracy based on your predicted labels for the
test data and the true test labels. Note also that we will not be providing any feedback on your predictions
or your test accuracy when you submit your assignment, so you must do your best without feedback on your
test performance.
Relevant Files in the Homework Skeleton1
• boostedDT.py
• test boostedDT.py
• data/challengeTrainLabeled.dat: labeled training data for the challenge
• data/challengeTestUnlabeled.dat: unlabeled testing data for the challenge
1.1 The Boosted Decision Tree Classifier
In class, we mentioned that boosted decision trees have been shown to be one of the best “out-of-the-box”
classifiers. (That is, if you know nothing about the data set and can’t do parameter tuning, they will likely
work quite well.) Boosting allows the decision trees to represent a much more complex decision surface than
a single decision tree.
Write a class that implements a boosted decision tree classifier. Your implementation may rely on the
decision tree classifier already provided in scikit learn (sklearn.tree.DecisionTreeClassifier), but you
must implement the boosting process yourself. (The scikit learn module actually provides boosting as a
meta-classifier, but you may not use it in your implementation.) Each decision tree in the ensemble should
be limited to a maximum depth as specified in the BoostedDT constructor. You can configure the maximum
depth of the tree via the max depth argument to the DecisionTreeClassifier constructor.
Your class must implement the following API:
• init (numBoostingIters = 100, maxTreeDepth = 3): the constructor, which takes in the number of boosting iterations (default value: 100) and the maximum depth of the member decision trees
(default: 3)
• fit(X,y): train the classifier from labeled data (X, y)
• predict(X): return an array of n predictions for each of n rows of X
1Bold text indicates files that you will need to complete; you should not need to modify any of the other files.
4
Note that these methods have already been defined correctly for you in boostedDT.py; be very careful not
to change the API. You should configure your boosted decision tree classifier to be the best “out-of-the-box”
classifier you can; you may not modify the constructor to take in additional parameters (e.g., to configure
the individual decision trees).
There is one additional change you need to make to AdaBoost beyond the algorithm described in class.
AdaBoost by default only works with binary classes, but in this case, we have a multi-class classification
problem. One variant of AdaBoost, called AdaBoost-SAMME, easily adapts AdaBoost to multiple classes.
Instead of using the equation βt =
1
2
ln
1−
in AdaBoost, you should use the AdaBoost-SAMME equation
βt =
1
2
ln
1 −
+ ln(K − 1)
,
where K is the total number of classes. This will force βt ≥ 0 as long as the classifier is worse than random
guessing (in this case random guessing would be 1/K, so the error rate would need to be greater than
1 − 1/K). Note that when K = 2, AdaBoost-SAMME reduces to AdaBoost. For further information on
SAMME, see https://web.stanford.edu/~hastie/Papers/samme.pdf.
Test your implementation by running test boostedDT.py, which compares your BoostedDT model to a
regular decision tree on the iris data with a 50:50 training/testing split. You should see that your BoostedDT
model is able to obtain ∼97.3% accuracy vs the 96% accuracy of regular decision trees. Make certain that
your implementation works correctly before moving on to the next part.
Once your boosted decision tree is working, train your BoostedDT on the labeled data available in the file
data/challengeTrainLabeled.dat. The class labels are specified in the last column of data. You may
tune the number of boosting iterations and maximum tree depth however you like. Then, use the trained
BoostedDT classifier to predict a label y ∈ {1, . . . , 9} for each unlabeled instance in data/challengeTestUnlabeled.dat.
Your implementation should output a comma-separated list of predicted labels, such as
1, 2, 1, 9, 4, 1, 3, 1, 5, 3, 4, 2, 8, 3, 1, 6, 3, ...
Be very careful not to shuffle the instances in data/challengeTestUnlabeled.dat; the first predicted label
should correspond to the first unlabeled instance in the testing data. The number of predictions should
match the number of unlabeled test instances.
Copy and paste this comma-separated list into the README file to submit your predictions for grading.
Also, record the expected accuracy of your model in the README file. Finally, also save the commaseparated list into a text file named predictions-BoostedDT.dat; this file should have exactly one line of
text that contains the list of predictions.
1.2 Training the Best Classifier
Now, train the very best classifier for the challenge data, and use that classifier to output a second vector
of predictions for the test instances. You may use any machine learning algorithm you like, and may tune
it any way you wish. You may use the method and helper functions built into scikit learn; you do not need
to implement the method yourself, but may if you wish. If you don’t want to use scikit learn, you may
use any other machine learning software you wish. If you can think of a way that the unlabeled data in
data/challengeTestUnlabeled.dat would be useful during the training process, you are welcome to let
your classifier have access to it during training.
Note that you will not be submitting an implementation of your optimal model, just its predictions.
Once again, use your trained model to output a comma-separated list of predicted labels for the unlabeled
instances in data/challengeTestUnlabeled.dat. Again, be careful not to shuffle the test instances; the
order of the predictions must match the order of the test instances.
Copy and paste this comma-separated list into the README file to submit your predictions for grading.
Also, record the expected accuracy of your model in the README file. Finally, also save the commaseparated list into a text file named predictions-BestClassifier.dat; this file should have exactly one
line of text that contains the list of predictions.