Description
Machine learning offers a number of methods for classifying data into discrete categories, such as k-means clustering. Decision trees provide a structure for such categorization, based on a series of decisions that led to separate distinct outcomes. In this assignment, you will work with decision trees to perform binary classification according to some decision boundary. Your challenge is to build and to train decision trees capable of solving useful classification problems. You will learn first how to build decision trees, then how to effectively train them and finally how to test their performance.
This assignment is due on T-Square as decision_trees.py on March 19th midnight AOE (anywhere on Earth).
Abstract
You will build, train and test decision tree models to perform basic classification tasks.
Motivation
Classification is used widely in machine learning to figure out how to sort new data that comes through.
Objectives
Students should understand how decision trees and random forests work. Students should develop and intuition for how and why accuracy differs for training and testing data based on different parameters.
Evaluation
Evaluation is using the last submission on Bonnie.
Datasets
We have provided you with two datasets:
part23_data.csv: 4 features, 1372 data points, binary classification (last column)
challenge_train.csv: 30 features, 6636 datapoints, binary classification (first column)
challenge_test.csv: not provided, similar to challenge_train but with 40% of the dataset
Imports
We are only allowing three imports: numpy, collections.Counter and time. We will be checking to see if any other libraries are used. You are not allowed to use any outside libraries especially for part 4 (challenge).
Introduction
For this assignment we’re going to need an explicit way to make structured decisions. The following is DecisionNode
, representing a decision node as some atomic choice in a binary decision graph. It can represent a class label (i.e. a final decision) or a binary decision to guide the us through a flow-chart to arrive at a decision. Note that in this representation ‘True’ values for a decision take us to the left. This is arbitrary but matters for what comes next.
import numpy as np
from collections import Counter
import time
class DecisionNode():
"""Class to represent a single node in
a decision tree."""
def __init__(self, left, right, decision_function,class_label=None):
"""Create a node with a left child, right child,
decision function and optional class label
for leaf nodes."""
self.left = left
self.right = right
self.decision_function = decision_function
self.class_label = class_label
def decide(self, feature):
"""Return on a label if node is leaf,
or pass the decision down to the node's
left/right child (depending on decision
function)."""
if self.class_label is not None:
return self.class_label
elif self.decision_function(feature):
return self.left.decide(feature)
else:
return self.right.decide(feature)
Part 1a: Building a Binary Tree by Hand
5 pts.
In build_decision_tree()
, construct a tree of decision nodes by hand in order to classify the data below, i.e. map each datum xx to a label yy. Select tests to be as small as possible (in terms of attributes), breaking ties among tests with the same number of attributes by selecting the one that classifies the greatest number of examples correctly. If multiple tests have the same number of attributes and classify the same number of examples, then break the tie using attributes with lower index numbers (e.g. select A1A1 over A2A2)
Datum | A1A1 | A2A2 | A3A3 | A4A4 | y |
---|---|---|---|---|---|
x1x1 | 1 | 0 | 0 | 0 | 1 |
x2x2 | 1 | 0 | 1 | 1 | 1 |
x3x3 | 0 | 1 | 0 | 0 | 1 |
x4x4 | 0 | 1 | 1 | 0 | 0 |
x5x5 | 1 | 1 | 0 | 1 | 1 |
x6x6 | 0 | 1 | 0 | 1 | 0 |
x7x7 | 0 | 0 | 1 | 1 | 1 |
x8x8 | 0 | 0 | 1 | 0 | 0 |
Hints: To get started, it might help to draw out the tree by hand with each attribute representing a node.
To create the decision function to pass to your DecisionNode
, you can create a lambda expression as follows:
func = lambda feature : feature[2] == 0
in which we would choose the left node if the third attribute is 0.
For example, if your tree looked like this: if A1=0 then class = 1; else class = 0
A1
/ \
1 0
You would write your code like this:
decision_tree_root= DecisionNode(None, None, lambda a1: a1 == 0)
decision_tree_root.left = DecisionNode(None, None, None, 1)
decision_tree_root.right = DecisionNode(None, None, None, 0)
return decision_tree_root
Requirements: The tree nodes should be less than 10 nodes including the leaf (not the number of instances, but the actual nodes in the tree).
examples = [[1,0,0,0],
[1,0,1,1],
[0,1,0,0],
[0,1,1,0],
[1,1,0,1],
[0,1,0,1],
[0,0,1,1],
[0,0,1,0]]
classes = [1,1,1,0,1,0,1,0]
def build_decision_tree():
"""Create decision tree
capable of handling the provided
data."""
# TODO: build full tree from root
decision_tree_root = None
return decision_tree_root
decision_tree_root = build_decision_tree()
Part 1b: Precision, Recall, Accuracy and Confusion Matrix
12 pts.
Now that we have a decision tree, we’re going to need some way to evaluate its performance. In most cases we’d reserve a portion of the training data for evaluation, or use cross-validation. For now let’s just see how your tree does on the provided examples. In the stubbed out code below, fill out the methods to compute the confusion matrix, accuracy, precision and recall for your classifier output. classifier_output
is just the list of labels that your classifier outputs, corresponding to the same examples as true_labels
. You can refer to Wikipedia for calculating the true/false positive/negative.
You should get 1.0 (float) for precision, recall and accuracy.
You can create a simple example for the confusion matrix for testing purposes and count by hand.
def confusion_matrix(classifier_output, true_labels):
#TODO output should be [[true_positive, false_negative], [false_positive, true_negative]]
#TODO output is a list
raise NotImplemented()
def precision(classifier_output, true_labels):
#TODO precision is measured as: true_positive/ (true_positive + false_positive)
raise NotImplemented()
def recall(classifier_output, true_labels):
#TODO: recall is measured as: true_positive/ (true_positive + false_negative)
raise NotImplemented()
def accuracy(classifier_output, true_labels):
#TODO accuracy is measured as: correct_classifications / total_number_examples
raise NotImplemented()
classifier_output = [decision_tree_root.decide(example) for example in examples]
p1_confusion_matrix = confusion_matrix(classifier_output, classes)
p1_accuracy = accuracy( classifier_output, classes )
p1_precision = precision(classifier_output, classes)
p1_recall = recall(classifier_output, classes)
print p1_confusion_matrix, p1_accuracy, p1_precision, p1_recall
def entropy(class_vector):
"""Compute the entropy for a list
of classes (given as either 0 or 1)."""
# TODO: finish this
raise NotImplemented()
def information_gain(previous_classes, current_classes ):
"""Compute the information gain between the
previous and current classes (a list of
lists where each list has 0 and 1 values)."""
# TODO: finish this
raise NotImplemented()
def test_information_gain():
""" Assumes information_gain() accepts (classes, [list of subclasses])
Feel free to edit / enhance this note with more tests """
restaurants = [0]*6 + [1]*6
split_patrons = [[0,0], [1,1,1,1], [1,1,0,0,0,0]]
split_food_type = [[0,1],[0,1],[0,0,1,1],[0,0,1,1]]
# If you're using numpy indexing add the following before calling information_gain()
# split_patrons = [np.array(i) for i in split_patrons] #convert to np array
# split_food_type = [np.array(i) for i in split_food_type]
gain_patrons = information_gain(restaurants, split_patrons)
gain_type = information_gain(restaurants, split_food_type)
assert round(gain_patrons,3) == 0.541, "Information Gain on patrons should be 0.541"
assert gain_type == 0.0, "Information gain on type should be 0.0"
print "Information Gain calculations correct..."
assert (information_gain([1,1,1,0,0,0],[[1,1,1],[0,0,0]])==1),"TEST FAILED"
assert (round(information_gain([1,1,1,0,0,0],[[1,1,0],[1,0,0]]),2)==0.08),"TEST FAILED"
def test_entropy():
assert (entropy([1,1,1,0,0,0])==1),"TEST FAILED"
assert (entropy([1,1,1,1,1,1])==0),"TEST FAILED"
assert (int(entropy([1,1,0,0,0,0])*100)==91),"TEST FAILED"
test_information_gain()
test_entropy()
Part 2b: Decision Tree Learning
20 pts.
- File to use: part23_data.csv
- Grading: average test accuracy over 10 rounds should be >= 70%
As the size of our training set grows, it rapidly becomes impractical to build these trees by hand. We need a procedure to automagically construct these trees.
For starters, let’s consider the following algorithm (a variation of C4.5) for the construction of a decision tree from a given set of examples:
1) Check for base cases:
a) If all elements of a list are of the same class, return a leaf node with the appropriate class label.
b) If a specified depth limit is reached, return a leaf labeled with the most frequent class.
2) For each attribute alpha: evaluate the normalized information gain gained by splitting on attribute $\alpha$
3) Let alpha_best be the attribute with the highest normalized information gain
4) Create a decision node that splits on alpha_best
5) Recur on the sublists obtained by splitting on alpha_best, and add those nodes as children of node
First, in the DecisionTree.__build_tree__()
method implement the above algorithm.
Next, in DecisionTree.classify()
below, write a function to produce classifications for a list of features once your decision tree has been built.
Some other helpful notes:
1) Your features and classify should be in numpy arrays where if the dataset was (m x n) then the features is (m x n-1) and classify is (m x 1).
2) These features are continuous features and you will need to split based on a threshold.
How grading works:
1) We load part23_data.csv and create our cross-validation training and test set with a k=10 folds.
2) We classify the training data onto the three then fit the testing data onto the tree.
3) We check the accuracy of your results versus the true results and we return the average of this over 10 iterations.
class DecisionTree():
"""Class for automatic tree-building
and classification."""
def __init__(self, depth_limit=float('inf')):
"""Create a decision tree with an empty root
and the specified depth limit."""
self.root = None
self.depth_limit = depth_limit
def fit(self, features, classes):
"""Build the tree from root using __build_tree__()."""
self.root = self.__build_tree__(features, classes)
def __build_tree__(self, features, classes, depth=0):
"""Implement the above algorithm to build
the decision tree using the given features and
classes to build the decision functions."""
#TODO: finish this
raise NotImplemented()
def classify(self, features):
"""Use the fitted tree to
classify a list of examples.
Return a list of class labels."""
class_labels = []
#TODO: finish this
raise NotImplemented()
return class_labels
Part 2c: Validation
10 pts.
- File to use: part23_data.csv
- Grading: average test accuracy over 10 rounds should be >= 70%
In general, reserving part of your data as a test set can lead to unpredictable performance- a serendipitous choice of your train or test split could give you a very inaccurate idea of how your classifier performs. That’s where k-fold cross validation comes in.
In generate_k_folds()
, we’ll split the dataset at random into k equal subsections. Then iterating on each of our k samples, we’ll reserve that sample for testing and use the other k-1 for training. Averaging the results of each fold should give us a more consistent idea of how the classifier is doing across the data as a whole.
How grading works: The same as 2b however, we use your generate_k_folds instead of ours.
def load_csv(data_file_path, class_index):
handle = open(data_file_path, 'r')
contents = handle.read()
handle.close()
rows = contents.split('\n')
out = np.array([[float(i) for i in r.split(',')] for r in rows if r])
if(class_index == -1):
# this is used for part23_data
classes= map(int, out[:,class_index])
features = out[:,:class_index]
return features, classes
elif(class_index == 0):
# this is used for challenge_train
classes= map(int, out[:, class_index])
features = out[:, 1:]
return features, classes
else:
# this is used for vectorize
return out
def generate_k_folds(dataset, k):
#TODO this method should return a list of folds,
# where each fold is a tuple like (training_set, test_set)
# where each set is a tuple like (examples, classes)
raise NotImplemented()
dataset = load_csv('part23_data.csv', -1)
ten_folds = generate_k_folds(dataset, 10)
accuracies = []
precisions = []
recalls = []
confusion = []
for fold in ten_folds:
train, test = fold
train_features, train_classes = train
test_features, test_classes = test
tree = DecisionTree( )
tree.fit( train_features, train_classes)
output = tree.classify(test_features)
accuracies.append( accuracy(output, test_classes))
precisions.append( precision(output, test_classes))
recalls.append( recall(output, test_classes))
confusion.append( confusion_matrix(output, test_classes))
Part 3: Random Forests
30 pts.
- File to use: part23_data.csv
- Grading: average test accuracy over 10 rounds should be >= 75%
The decision boundaries drawn by decision trees are very sharp, and fitting a decision tree of unbounded depth to a list of training examples almost inevitably leads to overfitting. In an attempt to decrease the variance of our classifier we’re going to use a technique called ‘Bootstrap Aggregating’ (often abbreviated ‘bagging’).
A Random Forest is a collection of decision trees, build as follows:
1) For every tree we're going to build:
a) Subsample the examples provided us (with replacement) in accordance with a provided example subsampling rate.
b) From the sample in a), choose attributes at random to learn on (in accordance with a provided attribute subsampling rate)
c) Fit a decision tree to the subsample of data we've chosen (to a certain depth)
Classification for a random forest is then done by taking a majority vote of the classifications yielded by each tree in the forest after it classifies an example.
Fill in RandomForest.fit()
to fit the decision tree as we describe above, and fill in RandomForest.classify()
to classify a given list of examples.
Your features and classify should be in numpy arrays where if the dataset was (m x n) then the features is (m x n-1) and classify is (n x 1).
To test, we will be using a forest with 5 trees, with a depth limit of 5, example subsample rate of 0.5 and attribute subsample rate of 0.5
How grading works: similar to 2b but with the call to Random Forest.
class RandomForest():
"""Class for random forest
classification."""
def __init__(self, num_trees, depth_limit, example_subsample_rate, attr_subsample_rate):
"""Create a random forest with a fixed
number of trees, depth limit, example
sub-sample rate and attribute sub-sample
rate."""
self.trees = []
self.num_trees = num_trees
self.depth_limit = depth_limit
self.example_subsample_rate = example_subsample_rate
self.attr_subsample_rate = attr_subsample_rate
def fit(self, features, classes):
"""Build a random forest of
decision trees."""
# TODO implement the above algorithm
raise NotImplemented()
def classify(self, features):
"""Classify a list of features based
on the trained random forest."""
# TODO implement classification for a random forest.
raise NotImplemented()
Part 4: Challenge Classifier
10 points.
- File to use: challenge_train.csv
- Grading: average training accuracy over 10 runs should be >= 80% and average ru accuracy over 10 runs should be >= 70%
You should be implementing some sort of a tree structure, students in the past have been able to call their RandomForest with different parameters. We also encourage things like boosting.
You’ve been provided with a sample of data from a research dataset in ‘challenge_train.csv’ while we have reserved a part of the dataset for testing called challenge_test.csv (which you do not have access to).
To get full points for this part of the assignment, you’ll need to get at least an average accuracy of 80% on the training data you have (challenge_train.csv), and at least an average accuracy of 70% on the holdout/test set (challenge_test.csv). We do provide how long it takes for your training and testing to run.
class ChallengeClassifier():
def __init__(self):
# initialize whatever parameters you may need here-
# this method will be called without parameters
# so if you add any to make parameter sweeps easier, provide defaults
raise NotImplemented()
def fit(self, features, classes):
# fit your model to the provided features
raise NotImplemented()
def classify(self, features):
# classify each feature in features as either 0 or 1.
raise NotImplemented()
Part 5: Vectorization!
7 points.
File to use: vectorize.csv
Last semester, students struggled a lot with assignment 5 not because of the assignment but of the vectorization requirement so that it can run under the time limit we have. As a result, we are adding a small section to this assignment that will hopefully introduce you to vectorization and some of the cool tricks you can use in python. We encourage you to use any numpy function out there (on good faith) to do the following functions.
For the three functions that we have, we are testing your code based on how fast it runs. It will need to beat the non-vectorized code to get full points.
As a reminder, please don’t ask the TA’s for help regarding this section, we will not be able to assist you in any way. This section was created to help get you ready for assignment_5; feel free to ask other students on Piazza or use the Internet.
How grading works: we run the non-vectorized code and your vectorized code 500 times, as long as the average time of your vectorized code is less than the average time of the non-vectorized code, you will get the points (given that your answer is correct).
import time
import resource
import numpy as np
class Vectorization():
def load_csv(self,data_file_path, class_index):
handle = open(data_file_path, 'r')
contents = handle.read()
handle.close()
rows = contents.split('\n')
out = np.array([[float(i) for i in r.split(',')] for r in rows if r])
if(class_index == -1):
classes= map(int, out[:,class_index])
features = out[:,:class_index]
return features, classes
elif(class_index == 0):
classes= map(int, out[:, class_index])
features = out[:, 1:]
return features, classes
else:
return out
# Vectorization #1: Loops!
# This function takes one matrix, multiplies by itself and then adds to itself.
# Output: return a numpy array
# 1 point
def non_vectorized_loops(self, data):
non_vectorized = np.zeros(data.shape)
for row in range(data.shape[0]):
for col in range(data.shape[1]):
non_vectorized[row][col] = data[row][col] * data[row][col] + data[row][col]
return non_vectorized
def vectorized_loops(self, data):
# TODO vectorize the code from above
# Bonnie time to beat: 0.09 seconds
raise NotImplemented()
def vectorize_1(self):
data = self.load_csv('vectorize.csv', 1)
start_time = resource.getrusage(resource.RUSAGE_SELF).ru_utime * 1000
real_answer = self.non_vectorized_loops(data)
end_time = resource.getrusage(resource.RUSAGE_SELF).ru_utime * 1000
print 'Non-vectorized code took %s seconds' % str(end_time-start_time)
start_time = resource.getrusage(resource.RUSAGE_SELF).ru_utime * 1000
my_answer = self.vectorized_loops(data)
end_time = resource.getrusage(resource.RUSAGE_SELF).ru_utime * 1000
print 'Vectorized code took %s seconds' % str(end_time-start_time)
assert np.array_equal(real_answer, my_answer), "TEST FAILED"
# Vectorization #2: Slicing and summation
# This function searches through the first 100 rows, looking for the row with the max sum
# (ie, add all the values in that row together)
# Output: return the max sum as well as the row number for the max sum
# 3 points
def non_vectorized_slice(self, data):
max_sum = 0
max_sum_index = 0
for row in range(100):
temp_sum = 0
for col in range(data.shape[1]):
temp_sum += data[row][col]
if (temp_sum > max_sum):
max_sum = temp_sum
max_sum_index = row
return max_sum, max_sum_index
def vectorized_slice(self, data):
# TODO vectorize the code from above
# Bonnie time to beat: 0.07 seconds
raise NotImplemented()
def vectorize_2(self):
data = self.load_csv('vectorize.csv', 1)
start_time = resource.getrusage(resource.RUSAGE_SELF).ru_utime * 1000
real_sum, real_sum_index = self.non_vectorized_slice(data)
end_time = resource.getrusage(resource.RUSAGE_SELF).ru_utime * 1000
print 'Non-vectorized code took %s seconds' % str(end_time-start_time)
start_time = resource.getrusage(resource.RUSAGE_SELF).ru_utime * 1000
my_sum, my_sum_index = self.vectorized_slice(data)
end_time = resource.getrusage(resource.RUSAGE_SELF).ru_utime * 1000
print 'Vectorized code took %s seconds' % str(end_time-start_time)
assert (real_sum==my_sum),"TEST FAILED"
assert (real_sum_index==my_sum_index), "TEST FAILED"
# Vectorization #3: Flattening and dictionaries
# This function flattens down data into a 1d array, creates a dictionary of how often a
# positive number appears in the data and displays that value
# Output: list of tuples [(1203,3)] = 1203 appeared 3 times in data
# 3 points
def non_vectorized_flatten(self, data):
unique_dict = {}
flattened = np.hstack(data)
for item in range(len(flattened)):
if flattened[item] > 0:
if flattened[item] in unique_dict:
unique_dict[flattened[item]] += 1
else:
unique_dict[flattened[item]] = 1
return unique_dict.items()
def vectorized_flatten(self, data):
# TODO vectorize the code from above
# Bonnie time to beat: 15 seconds
raise NotImplemented()
def vectorize_3(self):
data = self.load_csv('vectorize.csv', 1)
start_time = resource.getrusage(resource.RUSAGE_SELF).ru_utime * 1000
answer_unique = self.non_vectorized_flatten(data)
end_time = resource.getrusage(resource.RUSAGE_SELF).ru_utime * 1000
print 'Non-vectorized code took %s seconds'% str(end_time-start_time)
start_time = resource.getrusage(resource.RUSAGE_SELF).ru_utime * 1000
my_unique = self.vectorized_flatten(data)
end_time = resource.getrusage(resource.RUSAGE_SELF).ru_utime * 1000
print 'Vectorized code took %s seconds'% str(end_time-start_time)
assert np.array_equal(answer_unique, my_unique), "TEST FAILED"
vectorize = Vectorization()
vectorize.vectorize_1()
vectorize.vectorize_2()
vectorize.vectorize_3()
Bonus
Note: this part will be changing. Official annoucements for this bonus will be made through Piazza
We will be having a competition using your challenge classifier and a dataset of our choice. We will provide you with a portion of the dataset as well as the testing data (but without the labels) and you will upload your solution as a csv to Kaggle. Kaggle will evaluate your scores and the classifier with the highest accuracy will win the competiton. Any ties will be broken by the submission time.
We are still figuring out all the details for this bonus so hopefully it will be out by the time the midterm period is over. We will keep the competition available for at least a few weeks.
- First place: 3 bonus points on your final grade
- Second place: 2 bonus points on your final grade
- Third place: 1 bonus point on your final grade