Description

5/5 - (1 vote)

Machine learning offers a number of methods for classifying data into discrete categories, such as k-means clustering. Decision trees provide a structure for such categorization, based on a series of decisions that led to separate distinct outcomes. In this assignment, you will work with decision trees to perform binary classification according to some decision boundary. Your challenge is to build and to train decision trees capable of solving useful classification problems. You will learn first how to build decision trees, then how to effectively train them and finally how to test their performance.

This assignment is due on T-Square as decision_trees.py on March 19th midnight AOE (anywhere on Earth).

Abstract

You will build, train and test decision tree models to perform basic classification tasks.

Motivation

Classification is used widely in machine learning to figure out how to sort new data that comes through.

Objectives

Students should understand how decision trees and random forests work. Students should develop and intuition for how and why accuracy differs for training and testing data based on different parameters.

Evaluation

Evaluation is using the last submission on Bonnie.

Datasets

We have provided you with two datasets:

part23_data.csv: 4 features, 1372 data points, binary classification (last column) 

challenge_train.csv: 30 features, 6636 datapoints, binary classification (first column)

challenge_test.csv: not provided, similar to challenge_train but with 40% of the dataset

Imports

We are only allowing three imports: numpy, collections.Counter and time. We will be checking to see if any other libraries are used. You are not allowed to use any outside libraries especially for part 4 (challenge).

Introduction

For this assignment we’re going to need an explicit way to make structured decisions. The following is DecisionNode, representing a decision node as some atomic choice in a binary decision graph. It can represent a class label (i.e. a final decision) or a binary decision to guide the us through a flow-chart to arrive at a decision. Note that in this representation ‘True’ values for a decision take us to the left. This is arbitrary but matters for what comes next.

In [ ]:

import numpy as np
from collections import Counter
import time

class DecisionNode():
    """Class to represent a single node in
    a decision tree."""

    def __init__(self, left, right, decision_function,class_label=None):
        """Create a node with a left child, right child,
        decision function and optional class label
        for leaf nodes."""
        self.left = left
        self.right = right
        self.decision_function = decision_function
        self.class_label = class_label

    def decide(self, feature):
        """Return on a label if node is leaf,
        or pass the decision down to the node's
        left/right child (depending on decision
        function)."""
        if self.class_label is not None:
            return self.class_label
        elif self.decision_function(feature):
            return self.left.decide(feature)
        else:
            return self.right.decide(feature)

Part 1a: Building a Binary Tree by Hand

5 pts.

In build_decision_tree(), construct a tree of decision nodes by hand in order to classify the data below, i.e. map each datum $x$ to a label $y$ . Select tests to be as small as possible (in terms of attributes), breaking ties among tests with the same number of attributes by selecting the one that classifies the greatest number of examples correctly. If multiple tests have the same number of attributes and classify the same number of examples, then break the tie using attributes with lower index numbers (e.g. select $A_{1}$ over $A_{2}$ )

Datum	$A_{1}$	$A_{2}$	$A_{3}$	$A_{4}$	y
$x_{1}$	1	0	0	0	1
$x_{2}$	1	0	1	1	1
$x_{3}$	0	1	0	0	1
$x_{4}$	0	1	1	0	0
$x_{5}$	1	1	0	1	1
$x_{6}$	0	1	0	1	0
$x_{7}$	0	0	1	1	1
$x_{8}$	0	0	1	0	0

Hints: To get started, it might help to draw out the tree by hand with each attribute representing a node.

To create the decision function to pass to your DecisionNode, you can create a lambda expression as follows:

func = lambda feature : feature[2] == 0

in which we would choose the left node if the third attribute is 0.

For example, if your tree looked like this: if A1=0 then class = 1; else class = 0

   A1
 /   \
1      0

You would write your code like this:

decision_tree_root= DecisionNode(None, None, lambda a1: a1 == 0)
decision_tree_root.left = DecisionNode(None, None, None, 1)
decision_tree_root.right = DecisionNode(None, None, None, 0)
return decision_tree_root

Requirements: The tree nodes should be less than 10 nodes including the leaf (not the number of instances, but the actual nodes in the tree).

In [ ]:

examples = [[1,0,0,0],
            [1,0,1,1],
            [0,1,0,0],
            [0,1,1,0],
            [1,1,0,1],
            [0,1,0,1],
            [0,0,1,1],
            [0,0,1,0]]

classes = [1,1,1,0,1,0,1,0]

In [ ]:

def build_decision_tree():
    """Create decision tree
    capable of handling the provided 
    data."""
    # TODO: build full tree from root
    decision_tree_root = None

    return decision_tree_root

In [ ]:

decision_tree_root = build_decision_tree()

Part 1b: Precision, Recall, Accuracy and Confusion Matrix

12 pts.

Now that we have a decision tree, we’re going to need some way to evaluate its performance. In most cases we’d reserve a portion of the training data for evaluation, or use cross-validation. For now let’s just see how your tree does on the provided examples. In the stubbed out code below, fill out the methods to compute the confusion matrix, accuracy, precision and recall for your classifier output. classifier_output is just the list of labels that your classifier outputs, corresponding to the same examples as true_labels. You can refer to Wikipedia for calculating the true/false positive/negative.

You should get 1.0 (float) for precision, recall and accuracy.

You can create a simple example for the confusion matrix for testing purposes and count by hand.

In [ ]:

def confusion_matrix(classifier_output, true_labels):
    #TODO output should be [[true_positive, false_negative], [false_positive, true_negative]]
    #TODO output is a list
    raise NotImplemented()

def precision(classifier_output, true_labels):
    #TODO precision is measured as: true_positive/ (true_positive + false_positive)
    raise NotImplemented()
    
def recall(classifier_output, true_labels):
    #TODO: recall is measured as: true_positive/ (true_positive + false_negative)
    raise NotImplemented()
    
def accuracy(classifier_output, true_labels):
    #TODO accuracy is measured as:  correct_classifications / total_number_examples
    raise NotImplemented()
    
classifier_output = [decision_tree_root.decide(example) for example in examples]
  
p1_confusion_matrix = confusion_matrix(classifier_output, classes)
p1_accuracy = accuracy( classifier_output, classes )
p1_precision = precision(classifier_output, classes)
p1_recall = recall(classifier_output, classes)

print p1_confusion_matrix, p1_accuracy, p1_precision, p1_recall

Part 2a: Decision Tree Learning

6 pts.

You will need to implement entropy() and information_gain() in order to do so (hints here) and here). Test cases have been provided.

In [ ]:

def entropy(class_vector):
    """Compute the entropy for a list
    of classes (given as either 0 or 1)."""
    # TODO: finish this
    raise NotImplemented()
    
def information_gain(previous_classes, current_classes ):
    """Compute the information gain between the
    previous and current classes (a list of 
    lists where each list has 0 and 1 values)."""
    # TODO: finish this
    raise NotImplemented()


def test_information_gain():
   """ Assumes information_gain() accepts (classes, [list of subclasses])
       Feel free to edit / enhance this note with more tests """
   restaurants = [0]*6 + [1]*6
   split_patrons =   [[0,0], [1,1,1,1], [1,1,0,0,0,0]]
   split_food_type = [[0,1],[0,1],[0,0,1,1],[0,0,1,1]]
    
   # If you're using numpy indexing add the following before calling information_gain()
   # split_patrons =   [np.array(i) for i in split_patrons]   #convert to np array 
   # split_food_type = [np.array(i) for i in split_food_type]

   gain_patrons = information_gain(restaurants, split_patrons)
   gain_type = information_gain(restaurants, split_food_type)
   assert round(gain_patrons,3) == 0.541, "Information Gain on patrons should be 0.541"
   assert gain_type == 0.0, "Information gain on type should be 0.0"
   print "Information Gain calculations correct..."
    
   assert (information_gain([1,1,1,0,0,0],[[1,1,1],[0,0,0]])==1),"TEST FAILED"
   assert (round(information_gain([1,1,1,0,0,0],[[1,1,0],[1,0,0]]),2)==0.08),"TEST FAILED"

def test_entropy(): 
    assert (entropy([1,1,1,0,0,0])==1),"TEST FAILED"
    assert (entropy([1,1,1,1,1,1])==0),"TEST FAILED"
    assert (int(entropy([1,1,0,0,0,0])*100)==91),"TEST FAILED"
    
test_information_gain()
test_entropy()

Part 2b: Decision Tree Learning

20 pts.

File to use: part23_data.csv
Grading: average test accuracy over 10 rounds should be >= 70%

As the size of our training set grows, it rapidly becomes impractical to build these trees by hand. We need a procedure to automagically construct these trees.

For starters, let’s consider the following algorithm (a variation of C4.5) for the construction of a decision tree from a given set of examples:

1) Check for base cases: 
     a) If all elements of a list are of the same class, return a leaf node with the appropriate class label.
     b) If a specified depth limit is reached, return a leaf labeled with the most frequent class.

2) For each attribute alpha: evaluate the normalized information gain gained by splitting on attribute $\alpha$

3) Let alpha_best be the attribute with the highest normalized information gain

4) Create a decision node that splits on alpha_best

5) Recur on the sublists obtained by splitting on alpha_best, and add those nodes as children of node

First, in the DecisionTree.__build_tree__() method implement the above algorithm.

Next, in DecisionTree.classify() below, write a function to produce classifications for a list of features once your decision tree has been built.

Some other helpful notes:

1) Your features and classify should be in numpy arrays where if the dataset was (m x n) then the features is (m x n-1) and classify is (m x 1).

2) These features are continuous features and you will need to split based on a threshold.

How grading works:

1) We load part23_data.csv and create our cross-validation training and test set with a k=10 folds. 

2) We classify the training data onto the three then fit the testing data onto the tree. 

3) We check the accuracy of your results versus the true results and we return the average of this over 10 iterations.

In [ ]:

class DecisionTree():
    """Class for automatic tree-building
    and classification."""

    def __init__(self, depth_limit=float('inf')):
        """Create a decision tree with an empty root
        and the specified depth limit."""
        self.root = None
        self.depth_limit = depth_limit

    def fit(self, features, classes):
        """Build the tree from root using __build_tree__()."""
        self.root = self.__build_tree__(features, classes)

    def __build_tree__(self, features, classes, depth=0):  
        """Implement the above algorithm to build
        the decision tree using the given features and
        classes to build the decision functions."""
        #TODO: finish this
        raise NotImplemented()
        
    def classify(self, features):
        """Use the fitted tree to 
        classify a list of examples. 
        Return a list of class labels."""
        class_labels = []
        #TODO: finish this
        raise NotImplemented()
        return class_labels

Part 2c: Validation

10 pts.

File to use: part23_data.csv
Grading: average test accuracy over 10 rounds should be >= 70%

In general, reserving part of your data as a test set can lead to unpredictable performance- a serendipitous choice of your train or test split could give you a very inaccurate idea of how your classifier performs. That’s where k-fold cross validation comes in.

In generate_k_folds(), we’ll split the dataset at random into k equal subsections. Then iterating on each of our k samples, we’ll reserve that sample for testing and use the other k-1 for training. Averaging the results of each fold should give us a more consistent idea of how the classifier is doing across the data as a whole.

How grading works: The same as 2b however, we use your generate_k_folds instead of ours.

In [ ]:

def load_csv(data_file_path, class_index):
    handle = open(data_file_path, 'r')
    contents = handle.read()
    handle.close()
    rows = contents.split('\n')
    out = np.array([[float(i) for i in r.split(',')] for r in rows if r])
    
    if(class_index == -1):
        # this is used for part23_data
        classes= map(int,  out[:,class_index])
        features = out[:,:class_index]
        return features, classes
    elif(class_index == 0):
        # this is used for challenge_train
        classes= map(int,  out[:, class_index])
        features = out[:, 1:]
        return features, classes
    else:
        # this is used for vectorize
        return out

def generate_k_folds(dataset, k):
    #TODO this method should return a list of folds,
    # where each fold is a tuple like (training_set, test_set)
    # where each set is a tuple like (examples, classes)
    raise NotImplemented()
    
dataset = load_csv('part23_data.csv', -1)
ten_folds = generate_k_folds(dataset, 10)

accuracies = []
precisions = []
recalls = []
confusion = []

for fold in ten_folds:
    train, test = fold
    train_features, train_classes = train
    test_features, test_classes = test
    tree = DecisionTree( )
    tree.fit( train_features, train_classes)
    output = tree.classify(test_features)
    
    accuracies.append( accuracy(output, test_classes))
    precisions.append( precision(output, test_classes))
    recalls.append( recall(output, test_classes))
    confusion.append( confusion_matrix(output, test_classes))

Part 3: Random Forests

30 pts.

File to use: part23_data.csv
Grading: average test accuracy over 10 rounds should be >= 75%

The decision boundaries drawn by decision trees are very sharp, and fitting a decision tree of unbounded depth to a list of training examples almost inevitably leads to overfitting. In an attempt to decrease the variance of our classifier we’re going to use a technique called ‘Bootstrap Aggregating’ (often abbreviated ‘bagging’).

A Random Forest is a collection of decision trees, build as follows:

1) For every tree we're going to build:

    a) Subsample the examples provided us (with replacement) in accordance with a provided example subsampling rate.

    b) From the sample in a), choose attributes at random to learn on (in accordance with a provided attribute subsampling rate)

    c) Fit a decision tree to the subsample of data we've chosen (to a certain depth)

Classification for a random forest is then done by taking a majority vote of the classifications yielded by each tree in the forest after it classifies an example.

Fill in RandomForest.fit() to fit the decision tree as we describe above, and fill in RandomForest.classify() to classify a given list of examples.

Your features and classify should be in numpy arrays where if the dataset was (m x n) then the features is (m x n-1) and classify is (n x 1).

To test, we will be using a forest with 5 trees, with a depth limit of 5, example subsample rate of 0.5 and attribute subsample rate of 0.5

How grading works: similar to 2b but with the call to Random Forest.

In [ ]:

class RandomForest():
    """Class for random forest
    classification."""

    def __init__(self, num_trees, depth_limit, example_subsample_rate, attr_subsample_rate):
        """Create a random forest with a fixed 
        number of trees, depth limit, example
        sub-sample rate and attribute sub-sample
        rate."""
        self.trees = []
        self.num_trees = num_trees
        self.depth_limit = depth_limit
        self.example_subsample_rate = example_subsample_rate
        self.attr_subsample_rate = attr_subsample_rate

    def fit(self, features, classes):
        """Build a random forest of 
        decision trees."""
        # TODO implement the above algorithm
        raise NotImplemented()

    def classify(self, features):
        """Classify a list of features based
        on the trained random forest."""
        # TODO implement classification for a random forest.
        raise NotImplemented()

Part 4: Challenge Classifier

10 points.

File to use: challenge_train.csv
Grading: average training accuracy over 10 runs should be >= 80% and average ru accuracy over 10 runs should be >= 70%

You should be implementing some sort of a tree structure, students in the past have been able to call their RandomForest with different parameters. We also encourage things like boosting.

You’ve been provided with a sample of data from a research dataset in ‘challenge_train.csv’ while we have reserved a part of the dataset for testing called challenge_test.csv (which you do not have access to).

To get full points for this part of the assignment, you’ll need to get at least an average accuracy of 80% on the training data you have (challenge_train.csv), and at least an average accuracy of 70% on the holdout/test set (challenge_test.csv). We do provide how long it takes for your training and testing to run.

In [ ]:

class ChallengeClassifier():
    
    def __init__(self):
        # initialize whatever parameters you may need here-
        # this method will be called without parameters 
        # so if you add any to make parameter sweeps easier, provide defaults
        raise NotImplemented()
        
    def fit(self, features, classes):
        # fit your model to the provided features
        raise NotImplemented()
        
    def classify(self, features):
        # classify each feature in features as either 0 or 1.
        raise NotImplemented()

Part 5: Vectorization!

7 points.

File to use: vectorize.csv

Last semester, students struggled a lot with assignment 5 not because of the assignment but of the vectorization requirement so that it can run under the time limit we have. As a result, we are adding a small section to this assignment that will hopefully introduce you to vectorization and some of the cool tricks you can use in python. We encourage you to use any numpy function out there (on good faith) to do the following functions.

For the three functions that we have, we are testing your code based on how fast it runs. It will need to beat the non-vectorized code to get full points.

As a reminder, please don’t ask the TA’s for help regarding this section, we will not be able to assist you in any way. This section was created to help get you ready for assignment_5; feel free to ask other students on Piazza or use the Internet.

How grading works: we run the non-vectorized code and your vectorized code 500 times, as long as the average time of your vectorized code is less than the average time of the non-vectorized code, you will get the points (given that your answer is correct).

In [ ]:

import time
import resource
import numpy as np

class Vectorization():
    
    def load_csv(self,data_file_path, class_index):
        handle = open(data_file_path, 'r')
        contents = handle.read()
        handle.close()
        rows = contents.split('\n')
        out = np.array([[float(i) for i in r.split(',')] for r in rows if r])

        if(class_index == -1):
            classes= map(int,  out[:,class_index])
            features = out[:,:class_index]
            return features, classes
        elif(class_index == 0):
            classes= map(int,  out[:, class_index])
            features = out[:, 1:]
            return features, classes
        else:
            return out

    # Vectorization #1: Loops!
    # This function takes one matrix, multiplies by itself and then adds to itself.
    # Output: return a numpy array
    # 1 point
    def non_vectorized_loops(self, data):
        non_vectorized = np.zeros(data.shape)
        for row in range(data.shape[0]):
            for col in range(data.shape[1]):
                non_vectorized[row][col] = data[row][col] * data[row][col] + data[row][col]
        return non_vectorized

    def vectorized_loops(self, data):
        # TODO vectorize the code from above
        # Bonnie time to beat: 0.09 seconds
        raise NotImplemented()
        
    def vectorize_1(self):
        data = self.load_csv('vectorize.csv', 1)
        start_time = resource.getrusage(resource.RUSAGE_SELF).ru_utime * 1000
        real_answer = self.non_vectorized_loops(data)
        end_time = resource.getrusage(resource.RUSAGE_SELF).ru_utime * 1000
        print 'Non-vectorized code took %s seconds' % str(end_time-start_time)

        start_time = resource.getrusage(resource.RUSAGE_SELF).ru_utime * 1000
        my_answer = self.vectorized_loops(data)
        end_time = resource.getrusage(resource.RUSAGE_SELF).ru_utime * 1000
        print 'Vectorized code took %s seconds' % str(end_time-start_time)
        
        assert np.array_equal(real_answer, my_answer), "TEST FAILED"
    
    # Vectorization #2: Slicing and summation
    # This function searches through the first 100 rows, looking for the row with the max sum 
    # (ie, add all the values in that row together)
    # Output: return the max sum as well as the row number for the max sum
    # 3 points
    def non_vectorized_slice(self, data):
        max_sum = 0
        max_sum_index = 0
        for row in range(100):
            temp_sum = 0
            for col in range(data.shape[1]):
                temp_sum += data[row][col]

            if (temp_sum > max_sum):
                max_sum = temp_sum
                max_sum_index = row

        return max_sum, max_sum_index

    def vectorized_slice(self, data):
        # TODO vectorize the code from above
        # Bonnie time to beat: 0.07 seconds
        raise NotImplemented()
        
    def vectorize_2(self):
        data = self.load_csv('vectorize.csv', 1)
        start_time = resource.getrusage(resource.RUSAGE_SELF).ru_utime * 1000
        real_sum, real_sum_index = self.non_vectorized_slice(data)
        end_time = resource.getrusage(resource.RUSAGE_SELF).ru_utime * 1000
        print 'Non-vectorized code took %s seconds' % str(end_time-start_time)

        start_time = resource.getrusage(resource.RUSAGE_SELF).ru_utime * 1000
        my_sum, my_sum_index = self.vectorized_slice(data)
        end_time = resource.getrusage(resource.RUSAGE_SELF).ru_utime * 1000
        print 'Vectorized code took %s seconds' % str(end_time-start_time)

        assert (real_sum==my_sum),"TEST FAILED"
        assert (real_sum_index==my_sum_index), "TEST FAILED"
        
    # Vectorization #3: Flattening and dictionaries 
    # This function flattens down data into a 1d array, creates a dictionary of how often a  
    # positive number appears in the data and displays that value
    # Output: list of tuples [(1203,3)] = 1203 appeared 3 times in data
    # 3 points
    def non_vectorized_flatten(self, data):
        unique_dict = {}
        flattened = np.hstack(data)
        for item in range(len(flattened)):
            if flattened[item] > 0:
                if flattened[item] in unique_dict:
                    unique_dict[flattened[item]] += 1
                else:
                    unique_dict[flattened[item]] = 1

        return unique_dict.items()

    def vectorized_flatten(self, data):
        # TODO vectorize the code from above
        # Bonnie time to beat: 15 seconds
        raise NotImplemented()

    def vectorize_3(self):
        data = self.load_csv('vectorize.csv', 1)
        start_time = resource.getrusage(resource.RUSAGE_SELF).ru_utime * 1000
        answer_unique = self.non_vectorized_flatten(data)
        end_time = resource.getrusage(resource.RUSAGE_SELF).ru_utime * 1000
        print 'Non-vectorized code took %s seconds'% str(end_time-start_time)

        start_time = resource.getrusage(resource.RUSAGE_SELF).ru_utime * 1000
        my_unique = self.vectorized_flatten(data)
        end_time = resource.getrusage(resource.RUSAGE_SELF).ru_utime * 1000
        print 'Vectorized code took %s seconds'% str(end_time-start_time)

        assert np.array_equal(answer_unique, my_unique), "TEST FAILED"

In [ ]:

vectorize = Vectorization()

vectorize.vectorize_1()
vectorize.vectorize_2()
vectorize.vectorize_3()

Bonus

Note: this part will be changing. Official annoucements for this bonus will be made through Piazza

We will be having a competition using your challenge classifier and a dataset of our choice. We will provide you with a portion of the dataset as well as the testing data (but without the labels) and you will upload your solution as a csv to Kaggle. Kaggle will evaluate your scores and the classifier with the highest accuracy will win the competiton. Any ties will be broken by the submission time.

We are still figuring out all the details for this bonus so hopefully it will be out by the time the midterm period is over. We will keep the competition available for at least a few weeks.

First place: 3 bonus points on your final grade
Second place: 2 bonus points on your final grade
Third place: 1 bonus point on your final grade

In [ ]:

CS 6601 Assignment 4: Decision Tree Learning solved

Download Details:

Description

Abstract

Motivation

Objectives

Evaluation

Datasets

Imports

Introduction

Part 1a: Building a Binary Tree by Hand

Part 1b: Precision, Recall, Accuracy and Confusion Matrix

Part 2a: Decision Tree Learning

Part 2b: Decision Tree Learning

Part 2c: Validation

Part 3: Random Forests

Part 4: Challenge Classifier

Part 5: Vectorization!

Bonus

CS 6601 Assignment 4: Decision Tree Learning solved

Download Details:

Description

Abstract

Motivation

Objectives

Evaluation

Datasets

Imports

Introduction

Part 1a: Building a Binary Tree by Hand

Part 1b: Precision, Recall, Accuracy and Confusion Matrix

Part 2a: Decision Tree Learning

Part 2b: Decision Tree Learning

Part 2c: Validation

Part 3: Random Forests

Part 4: Challenge Classifier

Part 5: Vectorization!

Bonus

Related products

Solved CS 6601 Assignment 3: Probability

CS 6601 Assignment 2 Search solved

CS6601 Assignment 6 solved