## Description

P1. Use the MNIST dataset and build a binary classifier to detect 3 versus 5. You can assume 3 is the

positive class and 5 is the negatively class. Note that, for training, you only need a subset of the original

training dataset, which are those corresponding to 3s and 5s.

Similarly for testing, you will use the subset

of the original test dataset that corresponds to 3s and 5s.

For this part, use the SGDClassifier with the default hyperparameters unless mentioned otherwise.

(a) Use cross_val_score() to show the accuracy of prediction under cross validation.

(b) Use cross_val_predict() to generate predictions on the training data. Then, generate the following:

• The confusion matrix

• The precision score

• The recall score

• The F1 score

(c) Use cross_val_predict() to generate the prediction scores on the training set. Then, plot the

precision and recall curves as functions of the threshold value.

(d) Based on the curves, what will be a sensible threshold value to choose? Generate predictions

under the chosen threshold value. Evaluate the precision and recall scores using the predictions.

(e) Plot the ROC curve and evaluate the ROC AUC score.

(f) Try the RandomForestClassifier. Plot the ROC curve and evaluate the ROC AUC score.

(g) Repeat part (f) with feature scaling using StandardScaler().

P2. Build a multiclass classifier that distinguishes three classes: 3, 5, and others (i.e., neither 3 nor 5). Do

this by train three binary classifiers: one that distinguishes between 3 and 5, one that distinguishes 3 and

others, and one that distinguishes between 5 and others. Use the SGDClassifier for each the binary

classifiers.

For prediction, given the image of a digit, count the number of duels won as follows:

• Assume the digit is 3. Pass it to the 3-vs-5 classifier and 3-vs-others classifier, and count the

number of wins by 3.

• Assume the digit is 5. Pass it to the 3-vs-5 classifier and 5-vs-others classifier, and count the

number of wins by 5

• Assume the digit is ‘others’. Pass it to the 3-vs-others classifier and 5-vs-others classifier, and

count the number of wins by ‘others’.

Whichever gives the most wins, the assumed digit will be the predicted class for the input. If there is

tie, break the tie randomly.

P3. Use the KNeighborsClassifier, which has built-in support for multiclass classification, to classify all

the 10 digits. Try to build a classifier that achieves over 97% accuracy on the test set. The

KNeighborsClassifier works quite well for this task if you find the right hyperparameters. Use a grid

search on the weights and n_neighbors hyperparameters.

Once you find a good set of hyperparameters, please conduct error analysis on the training dataset. In

particular, find the confusion matrix, display it as an image using matshow(), and discuss the kinds of

errors that your model makes.

Please submit a pdf file that contains your code and results, and the Jupyter Notebook if you use it.

Hint: You can build on the code that we discussed during the lectures, which can be downloaded from

the github page: https://github.com/ageron/handson-ml2. Most of the code for P3 is related to Exercise 1.

P4. See the separate the PDF file ‘A4-P4.pdf’.