## Description

Code:

• For this assignment you may use the methods in sklearn.tree ,

sklinear.linear_model , sklearn.svm , sklearn.metrics and sklearn.neighbors .

6

• When using methods from sklinear.linear_model and sklearn.svm , after

training them you can call them via decision_function() only. Do not use

predict or score or predict_proba .

• You may also use sklearn.model_selection.KFold — but not any other methods in

sklearn.model_selection .

5/16/2021 3 Classification & Model Selection

https://www.notion.so/justindomke/3-Classification-Model-Selection-5202709ce7ca4440ba6e40df61cd455b 3/12

• If the assignment asks you to implement a particular function, you are expected to

implement it yourself. If you find that the function is implemented somewhere

within sklearn or np but not specifically banned above, your implementation

should not consist of a call to that function.

3

Preliminaries

Dataset

In this assignment you are given a set of 32×32 RBG images. There are four possible

labels. Your goal will be to train a predictor to recognize what is in the image. Here are

the first few elements of the training data, shown as images:

You are given a file data.npz .

data.npz 30013.1KB 4

The data can be loaded as follows:

stuff=np.load(“data.npz”) X_trn = stuff[“X_trn”] y_trn = stuff[“y_trn”] X_tst

= stuff[“X_tst”] # no Y_tst !

There are a total of 6000 training examples, and 1200 test examples, each with 3072

dimensions. Those dimensions correspond to 32x32x3 RBG images (32*32*3=3072). If

you like, you can plot an example with the following code:

from matplotlib import pyplot as plt def show(x): img =

x.reshape((3,32,32)).transpose(1,2,0) plt.imshow(img) plt.axis(‘off’)

plt.draw() plt.pause(0.01) show(X_trn[7])

5/16/2021 3 Classification & Model Selection

https://www.notion.so/justindomke/3-Classification-Model-Selection-5202709ce7ca4440ba6e40df61cd455b 4/12

Kaggle

is a very popular platform for creating and running machine learning

competitions. It allows the creators of competitions to evaluate submissions against a

secret test set, ensuring that competitors cannot “cheat” by fine-tuning their model

against the test set.

Kaggle

The makers of Kaggle also created a set of features for creating an “InClass”

competition, perfect for classes such as 589! We created a competition for

in which you are required to participate. However, don’t worry! It’s not truly a

“competition” as much as it is a way to automatically evaluate your submissions and to

familiarize you with the Kaggle platform. Again, to make it easier for us to grade, please

create a Kaggle account using your email address.

Assignment

3

umass.edu

The “competition” has two leaderboards: A public leaderboard and private leaderboard.

The test set is split into two sets: A “public” set containing about 30% of the data and a

“private” set containing the remaining 70%. In a normal competition, you can see how

well your submission is performing against the public set. In theory, one could use

brute force to find all of the correct answers. For this reason, in most competitions

submissions are scored against the “private” set. This assignment will, at various points,

ask you to report the performance of various solutions according to the public

leaderboard.

To submit solutions to Kaggle, you will be required to submit a .csv file with two

columns: an “Id” column and a “Category” column classifying the integer prediction for

each element in X_tst . A sample solution using randomly predicted outputs can be

generated as follows:

2

import numpy as np import csv def write_csv(y_pred, filename): “””Write a 1d

numpy array to a Kaggle-compatible .csv file””” with open(filename, ‘w’) as

csv_file: csv_writer = csv.writer(csv_file) csv_writer.writerow([‘Id’,

‘Category’]) for idx, y in enumerate(y_pred): csv_writer.writerow([idx, y])

data = np.load(‘data.npz’) X_tst = data[‘X_tst’] y_pred =

np.random.randint(0, 3, size=len(X_tst)) # random predictions

write_csv(y_pred, ‘sample_predictions.csv’)

You can use the write_csv helper function in your code if you find it helpful to ensure

that your solution is in the correct format.

5/16/2021 3 Classification & Model Selection

https://www.notion.so/justindomke/3-Classification-Model-Selection-5202709ce7ca4440ba6e40df61cd455b 5/12

Note that the leaderboard shows accuracy whereas the assignment in some places asks

for classification error. Note that these are related by

Classification Error = 1 − Accuracy,

so it is easy to translate between the two.

Simple Classifiers

Question 1 (5 points) Take a very small dataset with four scalar inputs:

x

(1)

x

(2)

x

(3)

x

(4)

=

=

=

=

1.0

2.0

3.0

4.0

There are two possible labels, as shown below:

y

(1)

y

(2)

y

(3)

y

(4)

=

=

=

=

1

0

1

1

For each of the following split points, what is the information gain? Show your work. 9+

• Split at x = 0.5

• Split at x = 1.5

• Split at x = 2.5

• Split at x = 3.5

• Split at x = 4.5

5/16/2021 3 Classification & Model Selection

https://www.notion.so/justindomke/3-Classification-Model-Selection-5202709ce7ca4440ba6e40df61cd455b 6/12

Question 2 (5 points) Consider a classification tree with a maximum depth of ,

trained on data wdith dimensions. What is the time complexity to evaluate that

classification tree on a single new input? Give an answer (Something like “order of

“) and explain in at most 3 sentences why your answer is correct.

M 6

D

log(M) D

Question 3 (5 points) Take a dataset with elements each with dimensions. What is

the time complexity to train a classification stump? Give an answer and explain why it’s

correct in at most 3 sentences.

N D 2

Question 4 (6 points) Train 6 different classification trees on the image data, with each

of the following maximum depths: {1,3,6,9,12,14}. (Do not apply any other restriction

when growing the tree.) Using 5-fold cross validation, estimate mean the out of sample

(generalization) classification error, and report this as a table. You should have one row

for each possible depth and one number, which is the mean estimated error.

9+

Question 5 (6 points) What depth performs best in the previous question? Using that

depth, make predictions on the test data, and upload your predictions to Kaggle. For

this question, you need to report:

2

• What depth you chose.

• What was your estimated generalization error using 5-fold cross-validation.

• What accuracy you observed on the public part of the leaderboard.

Question 6 (5 points) Consider a dataset with elements, each with dimensions.

What is the time complexity to evaluate a K-nearest neighbors classifier? Give an

answer and explain why it’s correct in at most 3 sentences.

N D 6

Question 7 (6 points) Do nearest-neighbor prediction for each of the following possible

values of K: {1, 3, 5, 7, 9, 11}. Using 5-fold cross-validation, estimate the out of sample

classification error, and report this as a table. (Warning: This question might take a

significant amount of computational time. You may consider using the n_jobs option.)

6

5/16/2021 3 Classification & Model Selection

https://www.notion.so/justindomke/3-Classification-Model-Selection-5202709ce7ca4440ba6e40df61cd455b 7/12

Question 8 (6 points) What K performs best in the previous question? Using that K,

make predictions on the test data, and upload your predictions to Kaggle. Report:

• What value K you chose.

• What was your estimated generalization error using 5-fold cross validation.

• What accuracy you observed on the public part of the leaderboard.

Question 9 (10 points) For both hinge loss and logistic loss, train linear models with

ridge regularization. That is, find w to minimize

L(y , w x ) +

n=1

∑

N

(n) ⊤ (n) λ∥w∥ .

2

where is the loss. For each loss and each of the regularization constants

, train a model and estimate the mean out of sample

loss/error using 5-fold cross-validation. Organize your errors as a 5×2 table, with one

row for each value of and one column for each training loss.

L λ ∈ 9+

{10 , 10 , 1, 10, 100} −4 −2

λ

Give 3 tables: one where you estimate 0-1 classification error, one where you estimate

logistic loss, and one where you estimate hinge loss. (You will report a total of 30

numbers.)

9+

(Hint: You should be aware of sklearn.svm.LinearSVC . Again, you are not permitted to

use predict() or predict_proba() . But decision_function() is OK.)

(Hint: There has been some confusion about this question. To clarify, you have 10

different training methods, corresponding to each combination of regularization

constant (5 options) and training loss (2 options). For each of these training methods,

you should estimate the generalization error using 5-fold cross-validation. But you

should estimate that generalization error in three ways, for 0-1, logistic, and hinge loss.

Since there are 10 training methods and 3 measures of generalization error, you report

a total of 30 numbers. That is all that you report for this question.)

Question 10 (6 points) Choose the training loss and that you think will perform best

on the public leaderboard. Make predictions for the test data and upload your

predictions to Kaggle. Report:

λ

5/16/2021 3 Classification & Model Selection

https://www.notion.so/justindomke/3-Classification-Model-Selection-5202709ce7ca4440ba6e40df61cd455b 8/12

1. What training loss and λ you chose

2. What was your estimated generalization error using 5-fold cross validation.

3. What accuracy you observed on the public part of the leaderboard.

Neural Networks

You will train several neural networks, each with a single hidden layer. These neural

networks can be written as

f(x) = c + Vσ(b + Wx).

Here:

• x is the input, a vector of length D

• W is a matrix of size M × D that maps input features to a hidden space

• b is the bias term for the hidden layer, a vector of length M

• is the activation function. You will need that the derivative is

.

σ(a) = tanh(a)

da =

dσ(a)

1 − tanh(a)

2

• V is a matrix of size O × M that maps the hidden space to the output space

• c is the bias term for the output space, a vector of length O

Note that is a function that maps a vector to a vector. We will refer

to the -the component of the output as .

f(x) : R

D → R

O

i f(x)i

For this problem, we will use the logistic loss, defined as

4

L(y, f) = −fy + log exp(f ),

i=0

∑

3

i

where is the label for the input , and is the output vector.

Note that is therefore the component of the output vector . Also be careful to

note that here, we are indexing from 0 instead of 1.

y ∈ {0, 1, 2, 3} x f ∈ RO

fy y

th f

f

Question 11 (5 points) Write a function to evaluate the neural network and loss. Your

function should have the following signature:

5/16/2021 3 Classification & Model Selection

https://www.notion.so/justindomke/3-Classification-Model-Selection-5202709ce7ca4440ba6e40df61cd455b 9/12

3

def prediction_loss(x,y,W,V,b,c): # do stuff here return L

This should return a scalar. Give your function directly in your report.

Question 12 (10 points) Write a function to evaluate the gradient of the neural network.

Your function should have the following signature. Do not use any packages outside of

numpy.

def prediction_grad(x,y,W,V,b,c): # do stuff here return dLdW, dLdV, dLdb,

dLdc

Each returned array should be the same size as the input, and contain the

corresponding gradient. So, for example, dLdW is the derivatives . Give

your function directly in your report.

3

∇W L(y, f(x))

Question 13 (10 points) Take the following inputs, where there are 3 hidden units and 2

outputs (y = 0 or y = 1):

7

x

y

W

V

b

c

= [1, 2]

= 1

=

⎝

⎛ 0.5

−0.5

1

−1

1

.5 ⎠

⎞

= (

−1

1

−1

1

1

1

)

= [0, 0, 0]

= [0, 0]

Run the function from the previous question to compute the gradient with respect to

W, V , b, and c. Give the results directly in your report, organized as you see above.

5/16/2021 3 Classification & Model Selection

https://www.notion.so/justindomke/3-Classification-Model-Selection-5202709ce7ca4440ba6e40df61cd455b 10/12

Autograd

The next several questions use the autograd toolbox, which you can install via pip

install autograd . A short demo of autograd can be found here:

Group Finder

Autograd Demo

Question 14 (5 points) Write a function to evaluate the same gradient as in Question 12

using the autograd toolbox (Hint: You will need to import the NumPy wrapper, import

autograd.numpy as np , and the grad high-order function, from autograd import grad ).

5

def prediction_grad_autograd(x,y,W,V,b,c): # do stuff here return dLdW, dLdV,

dLdb, dLdc

Give your function directly in your report. (You do not need to give any outputs from

your function, but it is suggested to check the results against Q13 since if they are

different, one must be wrong!)

Question 15 (5 points) Update your function from question 11. Instead of taking a

single input x and a single output y, take an 2D of inputs X (where the first dimension

indexes the different examples) and a 1D array of outputs Y. Also, take a regularization

constant λ and apply squared regularization to W and V . Do not regularize b or c .

Your function should be the sum of the logistic losses for each example in the dataset,

plus the regularizer loss applied to W and V . (To be explicit, the regularizer could be

written as .)

5

λ (∑ W + V vm vm

2 ∑mi mi

2 )

def prediction_loss_full(X,Y,W,V,b,c,λ): # do stuff here return L # include

regularization

Question 16 (5 points) Update your gradient function to work on a full dataset and

include regularization, as in the previous question. Again, you should use autograd.

5/16/2021 3 Classification & Model Selection

https://www.notion.so/justindomke/3-Classification-Model-Selection-5202709ce7ca4440ba6e40df61cd455b 11/12

def prediction_grad_full(X,Y,W,V,b,c,λ): # do stuff here return dLdW, dLdV,

dLdb, dLdc

Question 17 (15 points) Here is psuedo-code to optimize a function by gradient

descent with momentum.

h(w)

ave_grad = 0 for iter = 1, 2, … max_iters: ave_grad = (1 – momentum) *

ave_grad + momentum * ∇h(w) w = w – stepsize * ave_grad

For each size of the hidden layer, , train your neural network on the

main data for this homework. Weights for layers and should be initialized by

sampling from , where is the standard normal distribution and is the

number of input dimensions for that layer. Weight for and should be initialized as

zeros.

M ∈ {5, 40, 70}

W V

D

N (0,1) N (0, 1) D

b c

Use gradient descent with momentum, with 1000 iterations, a step size of 0.0001, a

momentum of 0.1, and .

4

λ = 1

Report the following:

1. For each value of , what is the total training time (in ms) for all iterations. (Give a

table with 3 entries.)

M

2. Make a plot of the training objective (regularized loss) as a function of iterations.

This should be a single plot with 3 curves, one for each value of . Include the plot

in your report.

M

Question 18 (10 points) Make a single train-validation split of the data with 50% used

for training and 50% for testing. Train your neural network using the parameters above

for each value of and give the estimated generalization error. Again, using the same

initial weights generated using the scheme above. Then, retrain your network on all the

data, make predictions for the Kaggle data, and upload to Kaggle. Report your

accuracy on the public leaderboard. Report:

4

M

5/16/2021 3 Classification & Model Selection

https://www.notion.so/justindomke/3-Classification-Model-Selection-5202709ce7ca4440ba6e40df61cd455b 12/12

• What value of M you chose.

• What accuracy you expected. 2

• What accuracy you observed on the leaderboard.