CS 4641: Machine Learning Assignment 3 solution

$30.00

Original Work ?
Category: You will Instantly receive a download link for .ZIP solution file upon Payment

Description

5/5 - (4 votes)

PART I: PROBLEM SET
Your solutions to the problems will be submitted as a single PDF document. Be certain that your problems
are well-numbered and that it is clear what your answers are. Additionally, you will be required to duplicate
your answers to particular problems in the README file that you will submit.
1 Probability decision boundary (10pts)
Consider a case where we have learned a conditional probability distribution P(y | x). Suppose there are
only two classes, and let p0 = P(y = 0 | x) and let p1 = P(y = 1 | x). A loss matrix gives the cost that
is incurred for each element of the confusion matrix. (E.g., true positives might cost nothing, but a false
positive might cost us $10.) Consider the following loss matrix:
y = 0 (true) y = 1 (true)
yˆ = 0 (predicted) 0 10
yˆ = 1 (predicted) 5 0
(a) Show that the decision ˆy that minimizes the expected loss is equivalent to setting a probability threshold
θ and predicting ˆy = 0 if p1 < θ and ˆy = 1 if p1 ≥ θ. (b) What is the threshold for this loss matrix? 2 Double counting the evidence (15pts) Consider a problem in which the binary class label Y ∈ {T, F} and each training example x has 2 binary attributes X1, X2 ∈ {T, F}. Let the class prior be p(Y = T) = 0.5 and p(X1 = T | Y = T) = 0.8 and p(X2 = T | Y = T) = 0.5. Likewise, p(X1 = F | Y = F) = 0.7 and p(X2 = F | Y = F) = 0.9. Attribute X1 provides slightly stronger evidence about the class label than X2. Assume X1 and X2 are truly independent given Y . Write down the naive Bayes decision rule. (a) What is the expected error rate of naive Bayes if it uses only attribute X1? What if it uses only X2? The expected error rate is the probability that each class generates an observation where the decision rule is incorrect. If Y is the true class label, let Yˆ (X1, X2) be the predicted class label. Then the expected error rate is p(X1, X2, Y | Y 6= Yˆ (X1, X2)). (b) Show that if naive Bayes uses both attributes, X1 and X2, the error rate is 0.235, which is better than if using only a single attribute (X1 or X2). (c) Now suppose that we create new attribute X3 that is an exact copy of X2. So for every training example, attributes X2 and X3 have the same value. What is the expected error of naive Bayes now? (d) Briefly explain what is happening with naive Bayes (2 sentences max). (e) Does logistic regression suffer from the same problem? Briefly explain why (2 sentences max). 3 PART II: PROGRAMMING EXERCISES 1 Challenge: Generalizing to Unseen Data (50 pts) One of the most difficult aspects of machine learning is that your classifier must generalize well to unseen data. In this exercise, we are supplying you with labeled training data and unlabeled test data. Specifically, you will not have access to the labels for the test data, which we will use to grade your assignment. You will fit the best model that you can to the given data and then use that model to predict labels for the test data. It is these predicted labels that you will submit, and we will grade your submission based on your test accuracy (relative to the best performance you should be able to obtain). Each instance belongs to one of nine classes, named ’1’ . . . ’9’. We will not provide any further information on the data set. You will submit two sets of predictions – one based on a boosted decision tree classifier (which you will write), and another set of predictions based on whatever machine learning method you like – you are free to choose any classification method. We will compute your test accuracy based on your predicted labels for the test data and the true test labels. Note also that we will not be providing any feedback on your predictions or your test accuracy when you submit your assignment, so you must do your best without feedback on your test performance. Relevant Files in the Homework Skeleton1 • boostedDT.py • test boostedDT.py • data/challengeTrainLabeled.dat: labeled training data for the challenge • data/challengeTestUnlabeled.dat: unlabeled testing data for the challenge 1.1 The Boosted Decision Tree Classifier In class, we mentioned that boosted decision trees have been shown to be one of the best “out-of-the-box” classifiers. (That is, if you know nothing about the data set and can’t do parameter tuning, they will likely work quite well.) Boosting allows the decision trees to represent a much more complex decision surface than a single decision tree. Write a class that implements a boosted decision tree classifier. Your implementation may rely on the decision tree classifier already provided in scikit learn (sklearn.tree.DecisionTreeClassifier), but you must implement the boosting process yourself. (The scikit learn module actually provides boosting as a meta-classifier, but you may not use it in your implementation.) Each decision tree in the ensemble should be limited to a maximum depth as specified in the BoostedDT constructor. You can configure the maximum depth of the tree via the max depth argument to the DecisionTreeClassifier constructor. Your class must implement the following API: • init (numBoostingIters = 100, maxTreeDepth = 3): the constructor, which takes in the number of boosting iterations (default value: 100) and the maximum depth of the member decision trees (default: 3) • fit(X,y): train the classifier from labeled data (X, y) • predict(X): return an array of n predictions for each of n rows of X 1Bold text indicates files that you will need to complete; you should not need to modify any of the other files. 4 Note that these methods have already been defined correctly for you in boostedDT.py; be very careful not to change the API. You should configure your boosted decision tree classifier to be the best “out-of-the-box” classifier you can; you may not modify the constructor to take in additional parameters (e.g., to configure the individual decision trees). There is one additional change you need to make to AdaBoost beyond the algorithm described in class. AdaBoost by default only works with binary classes, but in this case, we have a multi-class classification problem. One variant of AdaBoost, called AdaBoost-SAMME, easily adapts AdaBoost to multiple classes. Instead of using the equation βt = 1 2 ln 1−   in AdaBoost, you should use the AdaBoost-SAMME equation βt = 1 2  ln  1 −    + ln(K − 1) , where K is the total number of classes. This will force βt ≥ 0 as long as the classifier is worse than random guessing (in this case random guessing would be 1/K, so the error rate would need to be greater than 1 − 1/K). Note that when K = 2, AdaBoost-SAMME reduces to AdaBoost. For further information on SAMME, see http://web.stanford.edu/~hastie/Papers/samme.pdf. Test your implementation by running test boostedDT.py, which compares your BoostedDT model to a regular decision tree on the iris data with a 50:50 training/testing split. You should see that your BoostedDT model is able to obtain ∼97.3% accuracy vs the 96% accuracy of regular decision trees. Make certain that your implementation works correctly before moving on to the next part. Once your boosted decision tree is working, train your BoostedDT on the labeled data available in the file data/challengeTrainLabeled.dat. The class labels are specified in the last column of data. You may tune the number of boosting iterations and maximum tree depth however you like. Then, use the trained BoostedDT classifier to predict a label y ∈ {1, . . . , 9} for each unlabeled instance in data/challengeTestUnlabeled.dat. Your implementation should output a comma-separated list of predicted labels, such as 1, 2, 1, 9, 4, 1, 3, 1, 5, 3, 4, 2, 8, 3, 1, 6, 3, ... Be very careful not to shuffle the instances in data/challengeTestUnlabeled.dat; the first predicted label should correspond to the first unlabeled instance in the testing data. The number of predictions should match the number of unlabeled test instances. Copy and paste this comma-separated list into the README file to submit your predictions for grading. Also, record the expected accuracy of your model in the README file. Finally, also save the commaseparated list into a text file named predictions-BoostedDT.dat; this file should have exactly one line of text that contains the list of predictions. 1.2 Training the Best Classifier Now, train the very best classifier for the challenge data, and use that classifier to output a second vector of predictions for the test instances. You may use any machine learning algorithm you like, and may tune it any way you wish. You may use the method and helper functions built into scikit learn; you do not need to implement the method yourself, but may if you wish. If you don’t want to use scikit learn, you may use any other machine learning software you wish. If you can think of a way that the unlabeled data in data/challengeTestUnlabeled.dat would be useful during the training process, you are welcome to let your classifier have access to it during training. Note that you will not be submitting an implementation of your optimal model, just its predictions. Once again, use your trained model to output a comma-separated list of predicted labels for the unlabeled instances in data/challengeTestUnlabeled.dat. Again, be careful not to shuffle the test instances; the order of the predictions must match the order of the test instances. Copy and paste this comma-separated list into the README file to submit your predictions for grading. Also, record the expected accuracy of your model in the README file. Finally, also save the commaseparated list into a text file named predictions-BestClassifier.dat; this file should have exactly one line of text that contains the list of predictions.