Description
1 Preparations
In this lab you will use a set of predefined Python functions to build and
manipulate decision trees. In order to run this, you need to have Python
installed. We also use the Qt graphics library, PyQt, for plotting. PyQt
comes in different versions 4 or 5. For Mac users version 5 is recommended.
For different Windows versions consult the Internet. For Mac use Homebrew to install Python and PyQt5. For Ubuntu Python and PyQt are available in the debian package repository. Python and PyQt are also installed
on the Unix computers in the computer halls.
Note: It is possible to do the lab without using the plotting functions
found in PyQt, but then you will not be able to see the generated decision
trees.
2 MONK datasets
This lab uses the artificial MONK dataset from the UC Irvine repository.
The MONK’s problems are a collection of three binary classification problems MONK-1, MONK-2 and MONK-3 over a six-attribute discrete domain.
The attributes a1, a2, a3, a4, a5, a6 may take the following values:
a1 ∈ {1, 2, 3} a2 ∈ {1, 2, 3} a3 ∈ {1, 2}
a4 ∈ {1, 2, 3} a5 ∈ {1, 2, 3, 4} a6 ∈ {1, 2}
Consequently, there are 432 possible combinations of attribute values. The
true concepts underlying each MONK’s problem are given by table 1.
Assignment 0: Each one of the datasets has properties which makes
them hard to learn. Motivate which of the three problems is most
difficult for a decision tree algorithm to learn.
The data consists of three separate datasets MONK-1, MONK-2 and
MONK-3. Each dataset is further divided into a training and test set,
where the first one is used for learning the decision tree, and the second
one to evaluate its classification accuracy (see table 2). The datasets are
available in the file monkdata.py. In particular, six variables are defined
which contain the datasets: monk1, monk1test, monk2, monk2test, monk3
Table 1: True concepts behind the MONK datasets
MONK-1 (a1 = a2) ∨ (a5 = 1)
MONK-2 ai = 1 for exacly two i ∈ {1, 2, . . . , 6}
MONK-3 (a5 = 1 ∧ a4 = 1) ∨ (a5 6= 4 ∧ a2 6= 3)
MONK-3 has 5% additional noise (misclassification) in the training set.
Table 2: Characteristics of the three MONK datasets
Name # train # test # attributes # classes
MONK-1 124 432 6 2
MONK-2 169 432 6 2
MONK-3 122 432 6 2
and monk3test. Each dataset is a sequence (more precisely, a tuple) of
instances of the class Sample, defined in the same file.
You can access the data in your own Python scripts by importing the
monkdata.py file as a module like this:
import monkdata as m
This makes the variable m a shorthand for the module so that you can
access the datasets by writing m.monk1, etc.
3 Entropy
In order to decide on which attribute to split, decision tree learning algorithms such as ID3 and C4.5 use a statistical property called information
gain. It measures how well a particular attribute distinguishes among different target classifications. Information gain is measured in terms of the
expected reduction in the entropy or impurity of the data. The entropy of
an arbitrary collection of examples is measured by
Entropy(S) = −
X
i
pi
log2 pi (1)
in which pi denotes the proportion of examples of class i in S. The monk
dataset is a binary classification problem (class 0 or 1) and therefore equation
2
(1) simplifies to
Entropy(S) = −p0 log2 p0 − p1 log2 p1 (2)
where p0 and p1 = 1 − p0 are the proportions of examples belonging to class
0 and 1.
Assignment 1: The file dtree.py defines a function entropy which
calculates the entropy of a dataset. Import this file along with the
monks datasets and use it to calculate the entropy of the training
datasets.
Dataset Entropy
MONK-1
MONK-2
MONK-3
Assignment 2: Explain entropy for a uniform distribution and a
non-uniform distribution, present some example distributions with
high and low entropy.
4 Information Gain
The information gain measures the expected reduction in impurity caused by
partitioning the examples according to an attribute. It thereby indicates the
effectiveness of an attribute in classifying the training data. The information
gain of an attribute A, relative to a collection of examples S is defined as
Gain(S, A) = Entropy(S) −
X
k∈values(A)
|Sk|
|S|
Entropy(Sk) (3)
where Sk is the subset of examples in S for the attribute A has the value k.
Assignment 3: Use the function averageGain (defined in dtree.py)
to calculate the expected information gain corresponding to each of
the six attributes. Note that the attributes are represented as instances of the class Attribute (defined in monkdata.py) which you
can access via m.attributes[0], …, m.attributes[5]. Based on
the results, which attribute should be used for splitting the examples
at the root node?
Information Gain
Dataset a1 a2 a3 a4 a5 a6
MONK-1
MONK-2
MONK-3
Assignment 4: For splitting we choose the attribute that maximizes
the information gain, Eq.3. Looking at Eq.3 how does the entropy of
the subsets, Sk, look like when the information gain is maximized?
How can we motivate using the information gain as a heuristic for
picking an attribute for splitting? Think about reduction in entropy
after the split and what the entropy implies.
5 Building Decision Trees
Split the monk1 data into subsets according to the selected attribute using
the function select (again, defined in dtree.py) and compute the information gains for the nodes on the next level of the tree. Which attributes
should be tested for these nodes?
For the monk1 data draw the decision tree up to the first two levels and
assign the majority class of the subsets that resulted from the two splits
to the leaf nodes. You can use the predefined function mostCommon (in
dtree.py) to obtain the majority class for a dataset.
Now compare your results with that of a predefined routine for ID3. Use
the function buildTree(data, m.attributes) to build the decision tree.
If you pass a third, optional, parameter to buildTree, you can limit the
depth of the generated tree.
You can use print to print the resulting tree in text form, or use the
function drawTree from the file drawtree_qt4.py or drawtree_qt5.py,
depending on your PyQt version, to draw a graphical representation.
Assignment 5: Build the full decision trees for all three Monk
datasets using buildTree. Then, use the function check to measure the performance of the decision tree on both the training and
test datasets.
For example to built a tree for monk1 and compute the performance
on the test data you could use
import monkdata as m
import dtree as d
t=d.buildTree(m.monk1, m.attributes);
print(d.check(t, m.monk1test))
Compute the train and test set errors for the three Monk datasets
for the full trees. Were your assumptions about the datasets correct?
Explain the results you get for the training and test datasets.
Etrain Etest
MONK-1
MONK-2
MONK-3
6 Pruning
The idea of reduced error pruning is to consider each node in the tree as a
candidate for removal. A node is removed if the resulting pruned tree performs at least as well as the original tree over a separate validation dataset,
i.e. a dataset not used during training. When a node is removed, the subtree rooted at that node is replaced by a leaf node, to which the majority
classification of examples in that node is assigned.
For the purpose of pruning, we have to split our original training data
into one training set for building the tree and one validation set for pruning.
Notice, that using the test set for validation would be cheating because we
would then no longer be able to use the test set for independently estimating the true error of our pruned decision tree. Instead, we will randomly
partition the original training set into training and validation set. This can
be done by defining a function which randomly reorders the data samples
and returns the first and second parts separately:
import random
def partition(data, fraction):
ldata = list(data)
random.shuffle(ldata)
breakPoint = int(len(ldata) * fraction)
return ldata[:breakPoint], ldata[breakPoint:]
monk1train, monk1val = partition(m.monk1, 0.6)
In the file dtree.py there is a utility function allPruned which returns
a sequence of all possible ways a given tree can be pruned.
Write code which performs the complete pruning by repeatedly calling
allPruned and picking the tree which gives the best classification performance on the validation dataset. You should stop pruning when all the
pruned trees perform worse than the current candidate.
Assignment 6: Explain pruning from a bias variance trade-off perspective.
Assignment 7: Evaluate the effect pruning has on the test error for
the monk1 and monk3 datasets, in particular determine the optimal
partition into training and pruning by optimizing the parameter
fraction. Plot the classification error on the test sets as a function
of the parameter fraction ∈ {0.3, 0.4, 0.5, 0.6, 0.7, 0.8}.
Note that the split of the data is random. We therefore need to
compute the statistics over several runs of the split to be able to draw
any conclusions. Reasonable statistics includes mean and a measure
of the spread. Do remember to print axes labels, legends and data
points as you will not pass without them.