Description
You can work in pairs
In this homework, you will work on your own or in pairs to complete a classification task on a dataset of handwritten digits.
The data set
The dataset is included with this assignment (train-images-idx3-ubyte, train-labels-idx1-ubyte), and you can read more about it here: http://yann.lecun.com/exdb/mnist/ and you can read a paper that might help quite a bit here: http://yann.lecun.com/exdb/publis/pdf/lecun-98.pdf.
Scikit-learn
You also have to install scikit-learn, a machine learning library for Python, to answer questions in this homework: http://scikit-learn.org/stable/index.html
You should be running the latest stable version of scikit-learn (0.18.1,as of this writing).
If you want an example of how to train and call a classifier from scikit-learn, have a look at the man page for the support vector machine.
http://scikit-learn.org/stable/modules/svm.html#multi-class-classification
Most classifiers have similarly good documentation and are called in similar ways.
For easy-to-use model selection, cross validation, etc, check out this documentation http://scikit-learn.org/stable/model_selection.html#model-selection
Reading the MNIST data
We have given a helper file “mnist.py” (thanks to http://g.sweyla.com/blog/2012/mnist-numpy/) that you can use to read in the data as follows:
In [2]:
import matplotlib.pyplot as plt
import numpy as np
from mnist import load_mnist
%matplotlib inline
images, labels = load_mnist(digits=[9], path=’.’)
#Displaying the mean image for digit 9.
plt.imshow(images.mean(axis=0), cmap = ‘gray’)
plt.show()
That is what 9, when handwritten, looks like on average. Changing the digits argument to a list would give you all the images that match the labels in the list (e.g. digits = [0, 1, 2] would give you all the 0s, 1s, and 2s in MNIST). Setting path = ‘.’ makes it look for the MNIST data in the current directory.
1) Exploring the data (1 points)
To load the entire dataset, run the following:
In [3]:
images, labels = load_mnist(digits=range(0, 10), path=’.’)
Here each image[i] is a single handwritten image of the label[i]. For example, here’s image[35], and it’s label:
In [4]:
i = 35
plt.imshow(images[i], cmap = ‘gray’)
plt.title(‘Handwritten image of the digit ‘ + str(labels[i]))
plt.show()
It’s a 5!
From this large dataset, you’ll want to pick training and testing sets to build classifiers. To do this carefully, you’ll need to study the dataset. Answer the following questions:
A. (0.5 pt) Look at 50 examples of one of the digits. Show us some of the cases that you think might be challenging to be recognized by a classifeir. Why do you think they may be challenging?
B. (0.5 pt) How many images are there in total? How many images are there of each digit? You need to pick some subset of the data for training and testing. Pick a set of training and testing data. State how you selected your training and testing sets. Think about the goals of training and testing sets – we pick good training sets so our classifier generalizes to unseen data and we pick good testing sets to see whether our classifier generalizes. Justify your method for selecting the training and testing sets in terms of these goals.
2) Algorithm Selection (1 points)
Each classifier in scikit-learn has associated with it many hyperparameters (e.g. number of neighbors in KNN). The goal of this assignment is to understand the effect that these hyperparameters have on performance as well as how different classifiers compare to one another.
You’ll build two classifiers on the MNIST data. You must use scikit-learn to build these classifiers (http://scikit-learn.org/stable/). A page of particular note is: http://scikit-learn.org/stable/supervised_learning.html#supervised-learning.
Two different classifiers does not mean two different variants of a classifier. For example, two KNNs with two different K values are not different. Pick two different classification algorithms (e.g. SVM and Decision Tree).
Pick two classification algorithms you can use in the scikit-learn library. For each classification algorithm, do the following:
A. (.5 pt) At a high level (e.g. a paragraph or two), summarize how each algorithm works and how to use it for classification of images (e.g. how you would encode the data, how you would interpret the output of the classifier).
B. (.5 pt) For each hyperparameter of each algorithm (e.g. slack in an SVM, or K in a KNN), explain what it varies about the classification algorithm.
3) Classification (part 1) (2 points)
Now that you’ve selected two classifiers, you will build, train and test each one. Pick one of the two classifiers chosen in problem 2 and do the following.
A. (1 pt) Build a classifier. Complete the following starter code (see below) named classifier_1.py. Read all comments in the code carefully.
You need to submit 4 files for this question:
classifier_1.py
classifier_1.p (model file)
training_set_1.p
training_labels_1.p
(*.p files are ‘pickle’ files. ‘Pickling’ is a way to convert a python object into a chracter stream so that it can be saved on disk)
B. (1 pt) Design, explain, and perform experiments to find the best hyperparameters for your classifier. Show the following graphs illustrating classification performance along two dimensions:
Training set size vs performance on a testing set
Classifier parameters (e.g. number of neighbors in KNN) vs performance on a testing set Describe and analyzise your result.
Show us a confusion matrix for the data. Be sure to label your dimensions clearly on all graphs.
4) Classification (part 2) (2 points)
Repeat the same tasks from question 3 with the other classifier you described in question 2.
For Part (A), you the code you hand in must be in files with the following names:
classifier_2.py
classifier_2.p (model file)
training_set_2.p
training_labels_2.p
5) Visualizing misclassifications (1 points)
Visualizing misclassifications can sometimes help understand the behavior of a classifier. Show a set of images that were misclassified by each of your two classifiers. Report how it affects your understanding of the behavior of the classifier, or the dataset. How does this relate to the images you thought might be challenging in the section “Exploring the dataset” above?
6) Model comparison (1 points)
Compare your two classifiers. Which classifier performed better? Back up that assertion by citing results from your experiment. Why do you think this classifier has better performance?
7) Boosting (2 points)
Boosting is a way to improve classification performance by combining classifiers. Perform the adaboosting algorithm on your training and testing set by using the AdaBoostClassifier function in sckit-learn.
Answer the following questions and submit a python script named boosting.py that includes two functions, boosting_A( ) and boosting_B( ). Each function takes a training and a testing set and their labels, and returns predicted labels and confusion matrix. Since we deal with 10 classes (0 – 9), the confusion matrix should be a 10×10 array.
A. (0.5 pt) Try adaboosting with a weak classifer (default classifer of the AdaBoostClassifier function). Include a confusion matrix in your write-up. Does the boosting outperform classifers you built in the question 2?
B. (0.5 pt) Try adaboosting with an SVM classifier. Before performing boosting, you might need to find best hyperparameters for the SVM first. Include a confusion matrix in your write-up. Does the boosted classifier outperform the classifers you built in question 2?
C. (1 pt) Compare two boosted classifiers, one from A and the other from B. Which one is better? How did you compare their performances? Show us data.