## Description

# HW05: Practice with algorithm selection, grid search, cross validation, multiclass classification, one-class classification, imbalanced data, and model selection.

```
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import mixture
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn import svm, linear_model, datasets
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import (confusion_matrix, precision_score, recall_score,
accuracy_score, roc_auc_score, RocCurveDisplay)
from sklearn.datasets import make_classification
from sklearn.ensemble import GradientBoostingClassifier
from imblearn.over_sampling import RandomOverSampler
```

## 1. Algorithm selection for multiclass classification by optical recognition of handwritten digits

The digits dataset has 1797 labeled images of hand-written digits.

- 𝑋 =
`digits.data`

has shape (1797, 64).- Each image 𝑥𝑖 is represented as the 𝑖th row of 64 pixel values in the 2D
`digits.data`

array that corresponds to an 8×8 photo of a handwritten digit.

- Each image 𝑥𝑖 is represented as the 𝑖th row of 64 pixel values in the 2D
- 𝑦 =
`digits.target`

has shape (1797,). Each 𝑦𝑖 is a number from 0 to 9 indicating the handwritten digit that was photographed and stored in 𝑥𝑖.

### 1(a) Load the digits dataset and split it into training, validation, and test sets as I did in the lecture example code 07ensemble.html.

This step does not need to display any output.

```
# ... your code here ...
```

## 1(b) Use algorithm selection on training and validation data to choose a best classifier.

Loop through these four classifiers and corresponding parameters, doing a grid search to find the best hyperparameter setting. Use only the training data for the grid search.

- SVM:
- Try all values of
`kernel`

in ‘linear’, ‘rbf’. - Try all values of
`C`

in 0.01, 1, 100.

- Try all values of
- logistic regression:
- Use
`max_iter=5000`

to avoid a nonconvergence warning. - Try all values of
`C`

in 0.01, 1, 100.

- Use
- ID3 decision tree:
- Use
`criterion='entropy`

to get our ID3 tree. - Try all values of
`max_depth`

in 1, 3, 5, 7.

- Use
- kNN:
- (Use the default Euclidean distance).
- Try all values of
`n_neighbors`

in 1, 2, 3, 4.

Hint:

- Make a list of the four classifiers without setting any hyperparameters.
- Make a list of four corresponding parameter dictionaries.
- Loop through 0, 1, 2, 3:
- Run grid search on the 𝑖th classifier with the 𝑖th parameter dictionary on the training data. (The grid search does its own cross-validation using the training data.)
- Use the 𝑖th classifier with its best hyperparameter settings (just
`clf`

from`clf = GridSearchCV(...)`

) to find the accuracy of the model on the validation data, i.e. find`clf.score(X_valid, y_valid)`

.

- Keep track, as your loop progresses, of:
- the index 𝑖 of the best classifier (initialize it to
`-1`

or some other value) - the best accuracy score on validation data (initialize it to
`-np.Inf`

) - the best classifier with its hyperparameter settings, that is the best
`clf`

from`clf = GridSearchCV(...)`

(initialize it to`None`

or some other value)

- the index 𝑖 of the best classifier (initialize it to

I needed about 30 lines of code to do this. It took a minute to run.

```
# ... your code here ...
```

### 1(c) Use the test data to evaluate your best classifier and its hyperparameter settings from 1(b).

- Report the result of calling
`.score(X_test, y_test)`

on your best classifier/hyperparameters. - Make a confusion matrix from the true
`y_test`

values and the corresponding 𝑦^ values predicted by your best classifier/hyperparameters on`X_test`

. - For each of the wrong predictions (where
`y_test`

and your 𝑦^ values disagree), show:- The index 𝑖 in the test data of that example 𝑥
- The correct label 𝑦𝑖
- Your incorrect prediction 𝑦^𝑖
- A plot of that image

```
# ... your code here ...
```

## 2. One-class classification (outlier detection)

### 2(a) There is an old gradebook of mine at http://pages.stat.wisc.edu/~jgillett/451/data/midtermGrades.txt.

Use `pd.read_table()`

to read it into a DataFrame.

Hint: `pd.read_table()`

has many parameters. Check its documentation to find three parameters to:

- Read from the given URL
- Use the separator ‘\s+’, which means ‘one or more whitespace characters’
- Skip the first 12 rows, as they are a note to students and not part of the gradebook

```
# ... your code here ...
```

```
df = pd.read_table(filepath_or_buffer='https://pages.stat.wisc.edu/~jgillett/451/data/midtermGrades.txt',
sep='\s+', skiprows=12)
```

### 2(b) Use `clf = mixture.GaussianMixture(n_components=1)`

to make a one-class Gaussian model to decide which 𝑥=(Exam1,Exam2) are outliers:

- Set a matrix X to the first two columns, Exam1 and Exam.
- These exams were worth 125 points each. Transform scores to percentages in [0,100].
Hint: I tried the MinMaxScaler() first, but it does the wrong thing if there aren’t scores of 0 and 125 in each column. So I just multiplied the whole matrix by 100 / 125.

- Fit your classifier to X.
Hint:

- The reference page for
`mixture.GaussianMixture`

includes a`fit(X, y=None)`

method with the comment that y is ignored (as this is an unsupervised learning algorithm–there is no 𝑦) but present for API consistency. So we can fit with just X. - I got a warning about “KMeans … memory leak”. You may ignore this warning if you see it. I still got satisfactory results.

- The reference page for
- Print the center 𝜇 and covariance matrix 𝛴 from the two-variable 𝑁2(𝜇,𝛴) distribution you estimated.

```
# ... your code here ...
```

### 2(c) Here I have given you code to make a contour plot of the negative log likelihood −ln𝑓𝜇,𝛴(𝑥) for 𝑋∼𝑁2(𝜇,𝛴), provided you have set `clf`

.

```
# make contour plot of log-likelihood of samples from clf.score_samples()
margin = 10
x = np.linspace(0 - margin, 100 + margin)
y = np.linspace(0 - margin, 100 + margin)
grid_x, grid_y = np.meshgrid(x, y)
two_column_grid_x_grid_y = np.array([grid_x.ravel(), grid_y.ravel()]).T
negative_log_pdf_values = -clf.score_samples(two_column_grid_x_grid_y)
grid_z = negative_log_pdf_values
grid_z = grid_z.reshape(grid_x.shape)
plt.contour(grid_x, grid_y, grid_z, levels=10) # X, Y, Z
plt.title('(Exam1, Exam2) pairs')
```

Paste my code into your code cell below and add more code:

- Add black 𝑥– and 𝑦– axes. Label them Exam1 and Exam2.
- Plot the data points in blue.
- Plot 𝜇=
`clf.means_`

as a big lime dot. - Overplot (i.e. plot again) in red the 8 outliers determined by a threshold consisting of the 0.02 quantile of the pdf values 𝑓𝜇,𝛴(𝑥) for each 𝑥 in X.
Hint:

`clf.score_samples(X)`

gives log likelihood, so`np.exp(clf.score_samples(X))`

gives the required 𝑓𝜇,𝛴(𝑥) values.

```
# ... your code here ...
```

### What characterizes 7 of these 8 outliers? Write your answer in a markdown cell.

```
# ... your English text in a Markdown cell here ...
```

### 2(d) Write a little code to report whether, by the 0.02 quantile criterion, 𝑥= (Exam1=50, Exam2=100) is an outlier.

Hint: Compare 𝑓𝜇,𝛴(𝑥) to your threshold

```
# ... your code here ...
```

## 3. Explore the fact that accuracy can be misleading for imbalanced data.

Here I make a fake imbalanced data set by randomly sampling 𝑦 from a distribution with 𝑃(𝑦=0)=0.980 and 𝑃(𝑦=1)=0.020.

```
X, y = make_classification(n_samples=1000, n_features=4, n_classes=2, weights=[0.980, 0.020],
n_clusters_per_class=1, flip_y=0.01, random_state=0)
print(f'np.bincount(y)={np.bincount(y)}; we expect about 980 zeros and 20 ones.')
print(f'np.mean(y)={np.mean(y)}; we expect the proportion of ones to be about 0.020.')
```

np.bincount(y)=[973 27]; we expect about 980 zeros and 20 ones. np.mean(y)=0.027; we expect the proportion of ones to be about 0.020.

Here I split the data into 50% training and 50% testing data.

```
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5,
random_state=0, stratify=y)
print(f'np.bincount(y_train)={np.bincount(y_train)}')
print(f'np.mean(y_train)={np.mean(y_train)}.')
print(f'np.bincount(y_test)={np.bincount(y_test)}.')
print(f'np.mean(y_test)={np.mean(y_test)}.')
```

np.bincount(y_train)=[486 14] np.mean(y_train)=0.028. np.bincount(y_test)=[487 13]. np.mean(y_test)=0.026.

### 3a. Train and assess a gradient boosting model.

- Train on the training data.
- Use 100 trees of maximum depth 1 and learning rate 𝛼=0.25.
- Use
`random_state=0`

(so that teacher, TAs, and students have a chance of getting the same results). - Display the accuracy, precision, recall, and AUC on the test data. Use 3 decimal places. Use a labeled print statement with 3 decimal places so the reader can easily find each metric.
- Make an ROC curve from your classifier and the test data.

```
# ... your code here ...
```

Note the high accuracy but lousy precision, recall, and AUC.

Note that since the data have about 98% 𝑦=0, we could get about 98% accuracy by just always predicting 𝑦^=0. High accuracy alone is not necessarily helpful.

### 3b. Now oversample the data to get a balanced data set.

- Use the
`RandomOverSampler(random_state=0)`

to oversample and get a balanced data set. - Repeat my
`train_test_split()`

block from above. - Repeat your train/assess block from above.

```
# ... your code here ...
```

Note that we traded a little accuracy for much improved precision, recall, and AUC.

If you do classification in your project and report accuracy, please also report the proportions of 𝑦=0 and 𝑦=1 in your test data so that we get insight into whether your model improves upon always guessing 𝑦^=0 or always guessing 𝑦^=1.