Description
1. (15 points) Entropy for Decision Tree Learning
When deciding which split (feature and category or threshold) to use for the nodes when
learning a decision tree, we aim to maximize label purity in the child nodes. One way to
achieve this goal is to maximize the divergence from the uniform label distribution.
(a) Show that maximizing the divergence of the label distribution in the child nodes to the
uniform distribution is equivalent to minimizing entropy. HINT: use KL-divergence
(Kullback-Leibler Divergence) to quantify the distance between two discrete probability distributions.
(b) In the lecture we maximized information gain to decide on the splits. Show that maximizing information gain is equivalent to minimizing entropy. HINT: you can assume
binary splits.
(c) What is the time complexity to find the best split using entropy assuming binary features? What is the time complexity assuming continuous features?
2. (35 points) Efficient Decision Tree Learning
You are building a regression tree (CART), and your recursive function is called with data
D = {(~x1, y1), . . . ,(~xn, yn)} (where we have continuous attributes xi ∈ R
d and continuous labels
yi ∈ R). For a leaf let the prediction be the average predictor p ← y¯ =
1
|S|
P
(~xi,yi)∈S
yi
, where
S are all datapoints in that leaf. We use label variance as impurity measure.
(a) Show that p minimizes the average squared loss L(D) = 1
|D|
P
(~xj ,yj )∈D (yj − p)
2
. Explain why this must be the case.
(b) If the termination conditions are not met (i.e. you do not create a leaf), you need to split.
For this, you need to find the best split for any feature f. Assume we have sorted our
data set according to some feature f, such that f1 ≤ f2 ≤ · · · ≤ fn, where fi
is the value
of feature f in example ~xi
. Let all relevant splits be c1 ≤ c2 ≤ · · · ≤ cn−1, for example
ci =
fi+fi+1
2
.
i. Define the predictors y¯
i
L
, y¯
i
R
of the left and right sub-trees, after cutting at ci (left
≤ ci) in terms of y1, . . . , yn.
ii. Write down the expressions for the loss L
i
L
and L
i
R
on the left and right sub-trees,
after cutting at ci
. What is the time complexity of computing both terms for any cut
point ci?
iii. Express y¯
i+1
L
, y¯
i+1
R
in terms of y¯
i
L
, y¯
i
R
and yi+1.
iv. Let s
i =
Pi
j=1 y
2
j
and r
i =
Pn
j=i+1 y
2
j
, express L
i+1
L
,L
i+1
R
in terms of yi+1, y¯
i
L
, y¯
i
R
, si
, ri
.
v. Write a full update rule for iteration i + 1, assuming you have y¯
i
L
, y¯
i
R
, si
, ri
,L
i
, Ri
.
vi. What is the time complexity to find the best split using this method assuming your
data is already sorted? What is the time complexity for sorting D for all features?
Compare the time complexity to build an entire tree for the approach developed in
this problem to the time complexity of building an entire tree using entropy as in
problem 1(c).
2
3. (50 points) Implementation: Bagged and Boosted Decision Trees
In this assignment you will implement a decision tree algorithm and then use it for bagging
and boosting. Update your SVN repository. All of the following stub files are in the hw6
folder:
Stubs:
id3tree.m Returns a decision tree using the minimum entropy splitting rule
(implemented for you in entropysplit.m)
evaltree.m Evaluates a decision tree on a test data.
prunetree.m Prunes a tree using a validation set.
forest.m Builds a forest of id3trees.
evalforest.m Learns and evaluates a forest.
boosttree.m Applies adaboost to your id3tree.
evalboost.m Learns and evaluates a boosted classifier.
Already implemented:
entropysplit.m This function implements minimum entropy splitting
using information gain/entropy.
cart_tictoc.m This function computes train and test accuracy and runtime for
all your algorithms (uses analyze.m).
cart_visual.m This function visualizes trees (uses vistree.m and scat.m).
Datasets: (download from Piazza resources)
iris.mat Subset of the Iris flower dataset
(https://en.wikipedia.org/wiki/Iris_flower_data_set)
heart.mat Another dataset for binary classification.
As MATLAB does not support pointers and is really only good with matrices, we will represent a tree through a matrix T. A tree T with q nodes must be of dimensions 6 × q, where
each column represents a different node in the tree. The very first column is the root node
and each column represents the following information:
(1) prediction at this node
(2) index of feature to cut
(3) cutoff value c (<= c : left, and > c : right)
(4) index of left subtree (0 = leaf)
(5) index of right subtree (0 = leaf)
(6) parent (0 = root)
entropysplit.m searches through all features and returns the best split (feature, cutthreshold combination) based on information gain (in this case entropy). You don’t need
to implement this, it is already done for you in the file entropysplit.m. Please take a
minute and familiarize yourself with the implementation. It will make subsequent tasks
easier.
(a) Implement the function id3tree.m which returns a decision tree based on the minimum entropy splitting rule. The function takes training data, test data, a maximum
depth, and the weigh of each training example (used in entropy splitting). You can
3
visualize your tree with the command visualhw7. Maximum depth and weight are
optional arguments. If they are not provided you should make maximum depth infinity and equally weight each example. (Hint: To speed up your code make sure you
initialize the matrix T with the command T = zeros(6, q) with some estimate of
q.) You should use the function entropysplit.m
(b) Implement the function evaltree.m, which evaluates a decision tree on a given test
data set. You can test the accuracy of your implementation with hw7tictoc.
(c) Decision trees (without depth limit) are high variance classifiers. Consequently, overfitting is a serious problem. One cure around it is to prune your tree to stop making
”silly” splits that overfit to the training set. You prune a tree bottom up, by removing
leaves that do not reduce the validation error (reduced error pruning). Implement the
function prunetree.m, which returns a decision tree pruned for a validation data set.
The accuracy on the validation set should be the same or higher after pruning.
(d) A more effective way to prevent overfitting is to use bagging. Implement the function
forest.m, which builds a forest (not a random forest) of ID3 decision trees. Each tree
should be built using training data drawn by randomly sampling n examples from the
training data with replacement, you can use MATLAB’s randsample function to do
this. (Do not randomly sample features.)
(e) Implement the function evalforest.m, which evaluates a forest on a test data set.
Remember that random forests make predictions by majority vote.
(f) Another option to improve your decision trees is to build trees of small depth (e.g. only
depth=3 or depth=4). These do not have high variance, but instead suffer from high
bias. You can reduce the bias of a classifier with boosting. Implement the function
boosttree.m, which applies ADABOOST on your id3tree functions.
(g) Implement the function evalboost.m, which evaluates an ensemble of boosted decision trees.
Once, you have implemented all functions you can run cart_tictoc.m, which computes
training and test error as well as runtime for all four algorithms. Try different datasets (you
may also use the ones from the previous homeworks1
.) and play around with the input
parameters of your forest and Adaboost methods.
Submit your implementation by committing all your functions and the partners.txt file
to the HW6 folder in your SVN repository.
1
If you want to use digits or faces, you will need to adapt your boosting algorithm for multi-class classification.
(Not required for the homework.)
4