Description
HW05: Practice with algorithm selection, grid search, cross validation, multiclass classification, one-class classification, imbalanced data, and model selection.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import mixture
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn import svm, linear_model, datasets
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import (confusion_matrix, precision_score, recall_score,
accuracy_score, roc_auc_score, RocCurveDisplay)
from sklearn.datasets import make_classification
from sklearn.ensemble import GradientBoostingClassifier
from imblearn.over_sampling import RandomOverSampler
1. Algorithm selection for multiclass classification by optical recognition of handwritten digits
The digits dataset has 1797 labeled images of hand-written digits.
- 𝑋 =
digits.data
has shape (1797, 64).- Each image 𝑥𝑖 is represented as the 𝑖th row of 64 pixel values in the 2D
digits.data
array that corresponds to an 8×8 photo of a handwritten digit.
- Each image 𝑥𝑖 is represented as the 𝑖th row of 64 pixel values in the 2D
- 𝑦 =
digits.target
has shape (1797,). Each 𝑦𝑖 is a number from 0 to 9 indicating the handwritten digit that was photographed and stored in 𝑥𝑖.
1(a) Load the digits dataset and split it into training, validation, and test sets as I did in the lecture example code 07ensemble.html.
This step does not need to display any output.
# ... your code here ...
1(b) Use algorithm selection on training and validation data to choose a best classifier.
Loop through these four classifiers and corresponding parameters, doing a grid search to find the best hyperparameter setting. Use only the training data for the grid search.
- SVM:
- Try all values of
kernel
in ‘linear’, ‘rbf’. - Try all values of
C
in 0.01, 1, 100.
- Try all values of
- logistic regression:
- Use
max_iter=5000
to avoid a nonconvergence warning. - Try all values of
C
in 0.01, 1, 100.
- Use
- ID3 decision tree:
- Use
criterion='entropy
to get our ID3 tree. - Try all values of
max_depth
in 1, 3, 5, 7.
- Use
- kNN:
- (Use the default Euclidean distance).
- Try all values of
n_neighbors
in 1, 2, 3, 4.
Hint:
- Make a list of the four classifiers without setting any hyperparameters.
- Make a list of four corresponding parameter dictionaries.
- Loop through 0, 1, 2, 3:
- Run grid search on the 𝑖th classifier with the 𝑖th parameter dictionary on the training data. (The grid search does its own cross-validation using the training data.)
- Use the 𝑖th classifier with its best hyperparameter settings (just
clf
fromclf = GridSearchCV(...)
) to find the accuracy of the model on the validation data, i.e. findclf.score(X_valid, y_valid)
.
- Keep track, as your loop progresses, of:
- the index 𝑖 of the best classifier (initialize it to
-1
or some other value) - the best accuracy score on validation data (initialize it to
-np.Inf
) - the best classifier with its hyperparameter settings, that is the best
clf
fromclf = GridSearchCV(...)
(initialize it toNone
or some other value)
- the index 𝑖 of the best classifier (initialize it to
I needed about 30 lines of code to do this. It took a minute to run.
# ... your code here ...
1(c) Use the test data to evaluate your best classifier and its hyperparameter settings from 1(b).
- Report the result of calling
.score(X_test, y_test)
on your best classifier/hyperparameters. - Make a confusion matrix from the true
y_test
values and the corresponding 𝑦^ values predicted by your best classifier/hyperparameters onX_test
. - For each of the wrong predictions (where
y_test
and your 𝑦^ values disagree), show:- The index 𝑖 in the test data of that example 𝑥
- The correct label 𝑦𝑖
- Your incorrect prediction 𝑦^𝑖
- A plot of that image
# ... your code here ...
2. One-class classification (outlier detection)
2(a) There is an old gradebook of mine at http://pages.stat.wisc.edu/~jgillett/451/data/midtermGrades.txt.
Use pd.read_table()
to read it into a DataFrame.
Hint: pd.read_table()
has many parameters. Check its documentation to find three parameters to:
- Read from the given URL
- Use the separator ‘\s+’, which means ‘one or more whitespace characters’
- Skip the first 12 rows, as they are a note to students and not part of the gradebook
# ... your code here ...
df = pd.read_table(filepath_or_buffer='https://pages.stat.wisc.edu/~jgillett/451/data/midtermGrades.txt',
sep='\s+', skiprows=12)
2(b) Use clf = mixture.GaussianMixture(n_components=1)
to make a one-class Gaussian model to decide which 𝑥=(Exam1,Exam2) are outliers:
- Set a matrix X to the first two columns, Exam1 and Exam.
- These exams were worth 125 points each. Transform scores to percentages in [0,100].
Hint: I tried the MinMaxScaler() first, but it does the wrong thing if there aren’t scores of 0 and 125 in each column. So I just multiplied the whole matrix by 100 / 125.
- Fit your classifier to X.
Hint:
- The reference page for
mixture.GaussianMixture
includes afit(X, y=None)
method with the comment that y is ignored (as this is an unsupervised learning algorithm–there is no 𝑦) but present for API consistency. So we can fit with just X. - I got a warning about “KMeans … memory leak”. You may ignore this warning if you see it. I still got satisfactory results.
- The reference page for
- Print the center 𝜇 and covariance matrix 𝛴 from the two-variable 𝑁2(𝜇,𝛴) distribution you estimated.
# ... your code here ...
2(c) Here I have given you code to make a contour plot of the negative log likelihood −ln𝑓𝜇,𝛴(𝑥) for 𝑋∼𝑁2(𝜇,𝛴), provided you have set clf
.
# make contour plot of log-likelihood of samples from clf.score_samples()
margin = 10
x = np.linspace(0 - margin, 100 + margin)
y = np.linspace(0 - margin, 100 + margin)
grid_x, grid_y = np.meshgrid(x, y)
two_column_grid_x_grid_y = np.array([grid_x.ravel(), grid_y.ravel()]).T
negative_log_pdf_values = -clf.score_samples(two_column_grid_x_grid_y)
grid_z = negative_log_pdf_values
grid_z = grid_z.reshape(grid_x.shape)
plt.contour(grid_x, grid_y, grid_z, levels=10) # X, Y, Z
plt.title('(Exam1, Exam2) pairs')
Paste my code into your code cell below and add more code:
- Add black 𝑥– and 𝑦– axes. Label them Exam1 and Exam2.
- Plot the data points in blue.
- Plot 𝜇=
clf.means_
as a big lime dot. - Overplot (i.e. plot again) in red the 8 outliers determined by a threshold consisting of the 0.02 quantile of the pdf values 𝑓𝜇,𝛴(𝑥) for each 𝑥 in X.
Hint:
clf.score_samples(X)
gives log likelihood, sonp.exp(clf.score_samples(X))
gives the required 𝑓𝜇,𝛴(𝑥) values.
# ... your code here ...
What characterizes 7 of these 8 outliers? Write your answer in a markdown cell.
# ... your English text in a Markdown cell here ...
2(d) Write a little code to report whether, by the 0.02 quantile criterion, 𝑥= (Exam1=50, Exam2=100) is an outlier.
Hint: Compare 𝑓𝜇,𝛴(𝑥) to your threshold
# ... your code here ...
3. Explore the fact that accuracy can be misleading for imbalanced data.
Here I make a fake imbalanced data set by randomly sampling 𝑦 from a distribution with 𝑃(𝑦=0)=0.980 and 𝑃(𝑦=1)=0.020.
X, y = make_classification(n_samples=1000, n_features=4, n_classes=2, weights=[0.980, 0.020],
n_clusters_per_class=1, flip_y=0.01, random_state=0)
print(f'np.bincount(y)={np.bincount(y)}; we expect about 980 zeros and 20 ones.')
print(f'np.mean(y)={np.mean(y)}; we expect the proportion of ones to be about 0.020.')
np.bincount(y)=[973 27]; we expect about 980 zeros and 20 ones. np.mean(y)=0.027; we expect the proportion of ones to be about 0.020.
Here I split the data into 50% training and 50% testing data.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5,
random_state=0, stratify=y)
print(f'np.bincount(y_train)={np.bincount(y_train)}')
print(f'np.mean(y_train)={np.mean(y_train)}.')
print(f'np.bincount(y_test)={np.bincount(y_test)}.')
print(f'np.mean(y_test)={np.mean(y_test)}.')
np.bincount(y_train)=[486 14] np.mean(y_train)=0.028. np.bincount(y_test)=[487 13]. np.mean(y_test)=0.026.
3a. Train and assess a gradient boosting model.
- Train on the training data.
- Use 100 trees of maximum depth 1 and learning rate 𝛼=0.25.
- Use
random_state=0
(so that teacher, TAs, and students have a chance of getting the same results). - Display the accuracy, precision, recall, and AUC on the test data. Use 3 decimal places. Use a labeled print statement with 3 decimal places so the reader can easily find each metric.
- Make an ROC curve from your classifier and the test data.
# ... your code here ...
Note the high accuracy but lousy precision, recall, and AUC.
Note that since the data have about 98% 𝑦=0, we could get about 98% accuracy by just always predicting 𝑦^=0. High accuracy alone is not necessarily helpful.
3b. Now oversample the data to get a balanced data set.
- Use the
RandomOverSampler(random_state=0)
to oversample and get a balanced data set. - Repeat my
train_test_split()
block from above. - Repeat your train/assess block from above.
# ... your code here ...
Note that we traded a little accuracy for much improved precision, recall, and AUC.
If you do classification in your project and report accuracy, please also report the proportions of 𝑦=0 and 𝑦=1 in your test data so that we get insight into whether your model improves upon always guessing 𝑦^=0 or always guessing 𝑦^=1.