Description
PROBLEM 1 [50 points]
Using each dataset, build a decision tree (entropy-based criteria, for Spambase) or regression tree (VAR/MSE criteria for Housing prices) from the training set. Since the features are numeric
values, you will need to use thresholds mechanisms. Report (txt or pdf file) for each dataset the training and testing error for each of your trials:
simple decision tree using something like Information Gain or other Entropy-like notion of randomness
regression tree
try to limit the size of the tree to get comparable training and testing errors (avoid overfitting typical of deep trees)
PROBLEM 2 [25 points, GR ONLY]
DHS chapter8,
Consider a binary decision tree using entropy splits (splits have 2 branches, labels have K classes).
A) Prove that the decrease in entropy by a split on a binary yes/no feature can never be greater than 1 bit. HINT : use the mutual information formula I(X,Y) = H(Y) – H(Y|X) and use its
symmetric property.
B) [Optional, no credit] Generalize this result to the case of arbitrary branching B>1.
PROBLEM 3 [50p] Gradient Boosted Trees for Regression
Run gradient boosting with regression trees on housing dataset. Essentially repeat the following procedure i=1:10 rounds on the training set. Start in round i=1with labels Y_x as the original
labels.
Train a regression tree T_i of depth 2 on data X, labels Y_x (use your HW1 regression tree code as a procedure, or use a library).
Update labels for each datapoint x : Y_x = Y_x – T_i(x). No need to update a distribution over datapoints like in Adaboost.
Repeat
The overall prediction function is the sum of the trees. Report training and testing error as a. function of T = number of trees in the ensemble
PROBLEM 4 [50p] Gradient Boosted Trees for classification
Run gradient boosting with regression stumps/trees on 20Newsgroup dataset dataset. Or you can use 20Newsgroup_8classes dataset with extracted features (The zip file is called
8newsgroup.zip because the 20 labels have been grouped into 8 classes to make the problem easier). Or you can download 20NewsGroup from scikit-learn
PROBLEM 5 PCA features [50 points]
Spambase polluted dataset. A) Train and test Naive Bayes. Why the dramatic decrease in performance ? Expected Accuracy with Gaussian Fits: 62%
B) Run PCA first on the dataset in order to reduce dimensionality to about 100 features. You can use a PCA package or library of your choice.
Then train/test Naive Bayes on the PCA features. Explain the performance improvement. (To make it work, you have to apply PCA consistently to training and testing points: either apply for
training and store the PCA transformation to apply it later for each test point; or apply PCA once for entire dataset)
Expected Accuracy on Naive Bayes with Gaussian Fits, running on PCA features: 73%.
C) Implement your own PCA, and rerun Naive Bayes on obtained features.
D) [Optional , no credit] Run LDA instead of PCA before you train the Naive Bayes. You can use a LDA package or library of your choice.
PROBLEM 6 [30 points]
Given a ranking of binary items by prediction score, the ROC is the curve plotted of True Positives vs False Positives for all possible thresholds. The AUC is the area under the ROC curve.
Prove that the AUC is also the percentage of item pairs (i,j) in correct order.
As an illustrative example, the classifier output might be
——————————————-
object score truelabel
——————————————- A 100 1
B 99 1
C 96 0
D 95 1
E 90 1
F 85 0
G 82 1
H 60 0
K 40 0
I 38 0
The ROC curve is obtained by truncating the list above at all ranks, and for each such threshold computing false-positive-rate and true-positive-rate (and plot them).
The problem asks to show that the area under the ROC curve is approximated by the percentage of pairs in correct order. In this example the item pairs in incorrect order are (C,D), (C,E),
(C,G), (F,G)



