Description
This homework will use the data sets from the file spam.mat which can be found on Canvas or the
course webpage. Loading this file into MATLAB will generate two matrices: train spam, the training
set, with 3601 labeled examples, and test spam, the test set, with 1000 labeled examples.
Each example represents an email with 57 features and a binary label indicating whether the email is spam
(label = 1) or not (label = 0). A description of the data can be found at https://archive.ics.
uci.edu/ml/datasets/Spambase with detailed descriptions of the features at https://archive.
ics.uci.edu/ml/machine-learning-databases/spambase/spambase.names. The rows of the
data set found there have been randomly permuted to generate the train and test sets.
Write a MATLAB function
function tree = hw2 train DT(train data, depth)
which trains a complete binary decision tree on the training set train data. The parameter depth
is a positive integer indicating the depth of the tree: for example, depth = 1 is a tree with a root
node and 2 child nodes which are leaves, depth = 2 is a tree with a root node, 2 child nodes of the
root node, and 2 child nodes each of each child node of the root node which are leaves, for a total of
4 leaves. The output tree is any convenient MATLAB data structure to represent the decision tree.
I found MATLAB cells particularly convenient to use in my implementation.
Construct this tree using the recursive algorithm described in class maximizing the uncertainty
reduction at each step. Specifically, at each non-leaf node of the tree, find the feature and threshold
that maximizes the reduction in uncertainty for splitting the training set corresponding to that node
with that feature and threshold. Structure your MATLAB code in a generic manner so any of the
three uncertainty notions discussed in class can be plugged in as necessary.
Please use the following conventions to construct the tree:
• For any feature, only choose thresholds among the values of the feature in the training set.
• For any feature xi and threshold t, let the left child correspond to the condition xt ≤ t, and
the right child correspond to the condition xi > t.
• In case there are two features xi and xj that both achieve the maximum reduction in uncertainty, break the tie by choosing the lower indexed one (i.e. i if i < j) as the feature to split
on.
• If at any leaf the examples with 0 and 1 labels are equal in number, break the tie in favor of
the 0 label.
A few tips for writing your code: since the construction of the tree is recursive, it’s convenient
to have hw2 train DT call itself recursively with decreasing depth values. Also, at any node, for
any given feature, a quick way of finding the best threshold is to sort the training data in ascending
order of the feature values and compute reduction in uncertainty for thresholds equal to the feature
values in the sorted list.
Write another function which computes predictions for test data given the parameters computed
by the hw2 train DT function, and error rate made on the test data, with the following signature:
function loss = hw2 test DT(tree, test data)
Part 1 of assignment: Train a decision tree using the function hw2 train DT with train data
= train spam, for each value of depth ∈ {1, 3, 6} and each of the three notions of uncertainty
covered in class. For every value of depth and each uncertainty notion, report the training error
by using hw2 test DT on test data = train spam, as well as the test error by using hw2 test DT
on test data = test spam.
Next, write a function which computes a linear classifier using the online perceptron algorithm
with the following signature:
function weights = hw2 train perceptron(train data, passes)
The parameter passes is a positive integer specifying the number of passes to make over the
training data. In your implementation, use the trick of adding a constant extra feature with
value 1 to enable the online perceptron algorithm, which trains a homogenous linear classifier, to
output a linear classifier with a non-zero threshold value. Write another function which computes
predictions for test data given the weights computed by the hw2 train perceptron function, and
thereby computes the error rate made on the test data, with the following signature:
function loss = hw2 test perceptron(weights, test data)
Part 2 of assignment: Train a linear classifier using the function hw2 train perceptron with
train data = train spam, for each value of passes ∈ {1, 10, 50, 100}. For every value of passes,
report the training error by using hw2 test perceptron on test data = train spam, as well as
the test error by using hw2 test perceptron on test data = test spam.