## Description

## Problem 1 High-level descriptions

1.1 Dataset (Same as in Homework

1.) We will use mnist subset (images of handwritten digits from 0

to 9). The dataset is stored in a JSON-formated file mnist subset.json. You can access its training, validation, and test splits using the keys ‘train’, ‘valid’, and ‘test’, respectively. For example, suppose we load

mnist subset.json to the variable x. Then, x[

0

train0

] refers to the training set of mnist subset. This set is a

list with two elements: x[

0

train0

][0] containing the features of size N (samples) ×D (dimension of features),

and x[

0

train0

][1] containing the corresponding labels of size N.

Besides, for logistic regression in Sect. 2, you will be using synthetic datasets with two, three and five

classes.

1.2 Tasks

You will be asked to implement binary and multiclass classification (Sect. 2) and neural networks (Sect. 3).

Specifically, you will

• finish the implementation of all python functions in our template codes.

• run your code by calling the specified scripts to generate output files.

• add, commit, and push (1) all *.py files, and (2) all *.json and *.out files that you have amended

or created.

In the next two subsections, we will provide a high-level checklist of what you need to do. You are not

responsible for loading/pre-processing data; we have done that for you. For specific instructions, please

refer to text in Sect. 2 and Sect. 3, as well as corresponding python scripts.

1.2.1 Logistic regression

Coding In logistic.py, finish implementing the following functions: binary train, binary predict,

multinomial train, multinomial predict, ovr train and ovr predict. Refer to logistic.py

and Sect. 2 for more information.

Running your code Run the scripts logistic binary.sh and logistic multiclass.sh after you

finish your implementation. This will output:

• logistic binary.out

• logistic multiclass.out

What to submit Submit logistic.py, logistic binary.out, logistic multiclass.out.

1.2.2 Neural networks

Preparation Read Sect. 3 as well as dnn mlp.py and dnn cnn.py.

Coding First, in dnn misc.py, finish implementing

• forward and backward functions in class linear layer

• forward and backward functions in class relu

• backward function in class dropout (before that, please read forward function).

Refer to dnn misc.py and Sect. 3 for more information.

Second, in dnn cnn 2.py, finish implementing the main function. There are five TODO items. Refer to

dnn cnn 2.py and Sect. 3 for more information.

Running your code Run the scripts q33.sh, q34.sh, q35.sh, q36.sh, q37.sh, q38.sh, q310.sh after

you finish your implementation. This will generate, respectively,

3

MLP lr0.01 m0.0 w0.0 d0.0.json

MLP lr0.01 m0.0 w0.0 d0.5.json

MLP lr0.01 m0.0 w0.0 d0.95.json

LR lr0.01 m0.0 w0.0 d0.0.json

CNN lr0.01 m0.0 w0.0 d0.5.json

CNN lr0.01 m0.9 w0.0 d0.5.json

CNN2 lr0.001 m0.9 w0.0 d0.5.json

What to submit Submit dnn misc.py, dnn cnn 2.py, and the above seven .json files.

1.3 Cautions

• Do not import packages that are not listed above (See Python Packages section).

• Follow the instructions in each section strictly to code up your solutions.

• DO NOT CHANGE THE OUTPUT FORMAT.

• DO NOT MODIFY THE CODE UNLESS WE INSTRUCT YOU TO DO SO.

• A homework solution that mismatches the provided setup, such as format, name, initializations, etc.,

will not be graded.

• It is your responsibility to make sure that your code runs with Python 3.5.2 in the VM.

1.4 Advice We are extensively using softmax and sigmoid function in this homework. To avoid numerical

issues such as overflow and underflow caused by numpy.exp() and numpy.log(), please use the following

implementations:

• Let x be a input vector to the softmax function. Use ˜x = x − max(x) instead of using x directly for the

softmax function f . That is, if you want to compute f(x)i

, compute f(x˜)i =

exp(x˜i

)

∑

D

j=1

exp(x˜j

)

instead, which

is clearly mathematically equivalent but numerically more stable.

• If you are using numpy.log(), make sure the input to the log function is positive. Also, there may

be chances that one of the outputs of softmax, e.g. f(x˜)i

, is extremely small but you need the value

ln(f(x˜)i). In this case you should convert the computation equivalently into ˜xi − ln(∑

D

j=1

exp(x˜j)).

We have implemented and run the code ourselves without problems, so if you follow the instructions

and settings provided in the python files, you should not encounter overflow or underflow.

4

## Problem 2 Logistic Regression (20 Points)

For this assignment you are asked to implement Logistic Regression for binary and multiclass classification.

Q2.1 (6 Points)

In lecture 3 we discussed logistic regression for binary classification. In this problem, you are given a

training set D =

(xn, yn)

N

n=1

, where yi ∈ {0, 1} ∀i = 1…N. Important: note that here the binary labels are

not −1 or +1 as used in the lecutre, so be very careful about applying formulas from the lecture notes.

Your task is to learn the linear model specified by w

Tx + b that minimizes the logistic loss. Note that we

do not explicitly append the feature 1 to the data, so you need to explicitly learn the bias/intercept term

b too.

Specifically you need to implement function binary train in logistic.py which uses gradient

descent (not stochastic gradient descent) to find the optimal parameters (recall logistic regression does not

admit a closed-form solution).

In addition you need to implement function binary predict in logistic.py. We discuss two ways

of making predictions in logistic regression in lecture 4: deterministic prediction or randomized prediction.

Here you need to use the deterministic prediction.

After finishing implementation, please run logistic binary.sh which generates logistic binary.out.

What to submit:

• logistic.py

• logistic binary.out

Q2.2 (7 Points) In the lectures you learned several methods to perform multiclass classification. One of

them was one-versus-rest or one-versus-all approach.

For one-versus-rest classification in a problem with K classes, we need to train K classifiers using a

black-box. Classifier k is trained on a binary problem, where the two labels corresponds to belonging or

not belonging to class k. After that, the multiclass prediction is made based on the combination of all

predictions from K binary classifiers.

In this problem you will implement one-versus-rest using binary logistic regression (that you have implemented in Q2.1) as the black-box. Important: the way to predict discussed in the lecture is to randomized

over the classifiers that say “yes”; however, here since binary logistic regression naturally predicts a probability for each class (recall the sigmoid model), we will simply predict the class with the highest probability

(using numpy argmax).

To sum up, you need to complete functions OVR train and OVR predict to perform one-versus-rest

classification. After you finished implementation, please run logistic multiclass.sh script, which

will produce logistic multiclass.out.

What to submit: logistic.py and logistic multiclass.out.

Q2.3 (7 Points) Yet another multiclass classification method you learned was multinomial logistic regression.

Complete the functions multinomial train and multinomial predict to perform multinomial

logistic regression, following the same notes as in Q2.1, that is, 1) explicitly learn the biased term; 2) perform

gradient descent instead of stochastic gradient descent; 3) make deterministic predictions.

After you finished implementation, please run logistic multiclass.sh script, which will produce

logistic multiclass.out.

What to submit: logistic.py and logistic multiclass.out.

5

x

input features

u h a z yˆ

predicted label

linear(1)

relu linear(2)

softmax

Figure 1: A diagram of a multi-layer perceptron (MLP). The edges mean mathematical operations (modules), and the circles

mean variables. The term relu stands for rectified linear units.

##
Problem 3 Neural networks: multi-layer perceptrons (MLPs) and convolutional neural networks (CNNs)

(30 Points)

Background

In recent years, neural networks have been one of the most powerful machine learning models. Many toolboxes/platforms (e.g., TensorFlow, PyTorch, Torch, Theano, MXNet, Caffe, CNTK) are publicly available

for efficiently constructing and training neural networks. The core idea of these toolboxes is to treat a neural

network as a combination of data transformation (or mathematical operation) modules.

For example, in Fig. 1 we provide a diagram of a multi-layer perceptron (MLP, just another term for

fully connected feedforward networks we discussed in the lecture) for a K-class classification problem.

The edges correspond to modules and the circles correspond to variables. Let (x ∈ RD, y ∈ {1, 2, · · · , K}) be a

labeled instance, such an MLP performs the following computations

input features : x ∈ R

D (1)

linear(1)

: u = W(1)

x + b

(1)

,W(1) ∈ R

M×D and b

(1) ∈ R

M (2)

relu : h = max{0, u} =

max{0, u1}

.

.

.

max{0, uM}

(3)

linear(2)

: a = W(2)h + b

(2)

,W(2) ∈ R

K×M and b

(2) ∈ R

K

(4)

softmax : z =

e

a1

∑k

e

ak

.

.

.

e

aK

∑k

e

ak

(5)

predicted label : yˆ = arg maxk

zk

. (6)

For a K-class classification problem, one popular loss function for training (i.e., to learn W(1)

, W(2)

, b

(1)

,

b

(2)

) is the cross-entropy loss. Specifically we denote the cross-entropy loss with respect to the training

example (x, y) by l:

l = − log(zy) = log

1 + ∑

k6=y

e

ak−ay

!

Note that one should look at l as a function of the parameters of the network, that is, W(1)

, b

(1)

,W(2) and

b

(2)

. For ease of notation, let us define the one-hot (i.e., 1-of-K) encoding of a class y as

6

y ∈ R

K

and yk =

(

1, if y = k,

0, otherwise.

(7)

so that

l = − ∑

k

yk

log zk = −y

T

log z1

.

.

.

log zK

= −y

T

log z. (8)

We can then perform error-backpropagation, a way to compute partial derivatives (or gradients) w.r.t

the parameters of a neural network, and use gradient-based optimization to learn the parameters.

Modules

Now we will provide more information on modules for this assignment. Each module has its own parameters (but note that a module may have no parameters). Moreover, each module can perform a forward

pass and a backward pass.

The forward pass performs the computation of the module, given the input to

the module. The backward pass computes the partial derivatives of the loss function w.r.t. the input and

parameters, given the partial derivatives of the loss function w.r.t. the output of the module. Consider a

module hmodule namei.

Let hmodule namei.forward and hmodule namei.backward be its forward and

backward passes, respectively.

For example, the linear module may be defined as follows.

forward pass: u = linear(1)

.forward(x) = W(1)

x + b

(1)

, (9)

where W(1)

and b

(1)

are its parameters.

backward pass: [

∂l

∂x

,

∂l

∂W(1)

,

∂l

∂b

(1)

] = linear(1)

.backward(x,

∂l

∂u

). (10)

Let us assume that we have implemented all the desired modules. Then, getting ˆy for x is equivalent to

running the forward pass of each module in order, given x. All the intermediated variables (i.e., u, h, etc.)

will all be computed along the forward pass. Similarly, getting the partial derivatives of the loss function

w.r.t. the parameters is equivalent to running the backward pass of each module in a reverse order, given

∂l

∂z

.

In this question, we provide a Python environment based on the idea of modules. Every module is

defined as a class, so you can create multiple modules of the same functionality by creating multiple object

instances of the same class.

Your work is to finish the implementation of several modules, where these

modules are elements of a multi-layer perceptron (MLP) or a convolutional neural network (CNN). We

will apply these models to the same 10-class classification problem introduced in Sect. 2. We will train

the models using stochastic gradient descent with mini-batch, and explore how different hyperparameters

of optimizers and regularization techniques affect training and validation accuracies over training epochs.

For deeper understanding, check out, e.g., the seminal work of Yann LeCun et al. “Gradient-based learning

applied to document recognition,” written in 1998.

We give a specific example below. Suppose that, at iteration t, you sample a mini-batch of N examples

{(xi ∈ RD, yi ∈ RK)}

N

i=1

from the training set (K = 10). Then, the loss of such a mini-batch given by Fig. 1

is

7

x yˆ

linear(1)

relu dropout linear(2)

softmax

Figure 2: The diagram of the MLP implemented in dnn mlp.py. The circles mean variables and edges mean modules.

lmb =

1

N

N

∑

i=1

l(softmax.forward(linear(2)

.forward(relu.forward(linear(1)

.forward(xi)))), yi) (11)

=

1

N

N

∑

i=1

l(softmax.forward(linear(2)

.forward(relu.forward(ui))), yi) (12)

= · · · (13)

=

1

N

N

∑

i=1

l(softmax.forward(ai), yi) (14)

=

1

N

N

∑

i=1

K

∑

k=1

yik log zik. (15)

That is, in the forward pass, we can perform the computation of a certain module to all the N input examples, and then pass the N output examples to the next module. This is the same case for the backward pass.

For example, according to Fig. 1, given the partial derivatives of the loss w.r.t. {ai}

N

i=1

∂lmb

∂{ai}

N

i=1

=

(

∂lmb

∂a1

)

T

(

∂lmb

∂a2

)

T

.

.

.

(

∂lmb

∂aN−1

)

T

(

∂lmb

∂aN

)

T

, (16)

linear(2)

.backward will compute

∂lmb

∂{hi}

N

i=1

and pass it back to relu.backward.

Preparation

Q3.1 Please read through dnn mlp.py and dnn cnn.py. Both files will use modules defined in dnn misc.py

(which you will modify). Your work is to understand how modules are created, how they are linked to

perform the forward and backward passes, and how parameters are updated based on gradients (and momentum). The architectures of the MLP and CNN defined in dnn mlp.py and dnn cnn.py are shown in

Fig. 2 and Fig. 3, respectively.

What to submit: Nothing.

Coding: Modules

8

x yˆ

convolution relu max pooling flatten dropout linear softmax

Figure 3: The diagram of the CNN implemented in dnn cnn.py. The circles correspond to variables and edges

correspond to modules. Note that the input to CNN may not be a vector (e.g., in dnn cnn.py it is an image, which can

be represented as a 3-dimensional tensor). The flatten layer is to reshape its input into vector.

Q3.2 (14 Points) You will modify dnn misc.py. This script defines all modules that you will need to

construct the MLP and CNN in dnn mlp.py and dnn cnn.py, respectively. You have three tasks. First,

finish the implementation of forward and backward functions in class linear layer. Please follow

Eqn. (2) for the forward pass and derive the partial derivatives accordingly.

Second, finish the implementation of forward and backward functions in class relu. Please follow Eqn. (3) for the forward pass

and derive the partial derivatives accordingly. Third, finish the the implementation of backward function

in class dropout. We define the forward and the backward passes as follows.

forward pass: s = dropout.forward(q ∈ R

J

) = 1

1 − r

×

1[p1 >= r] × q1

.

.

.

1[pJ >= r] × qJ

, (17)

where pj

is sampled uniformly from [0, 1), ∀j ∈ {1, · · · , J},

and r ∈ [0, 1) is a pre-defined scalar named dropout rate. (18)

backward pass: ∂l

∂q

= dropout.backward(q,

∂l

∂s

) = 1

1 − r

×

1[p1 >= r] ×

∂l

∂s1

.

.

.

1[pJ >= r] ×

∂l

∂sJ

. (19)

Note that pj

, j ∈ {1, · · · , J} and r are not be learned so we do not need to compute the derivatives w.r.t.

to them. Moreover, pj

, j ∈ {1, · · · , J} are re-sampled every forward pass, and are kept for the following

backward pass. The dropout rate r is set to 0 during testing.

Detailed descriptions/instructions about each pass (i.e., what to compute and what to return) are included in dnn misc.py. Please do read carefully.

Note that in this script we do import numpy as np. Thus, to call a function XX from numpy, please

use np.XX.

What to do and submit: Finish the implementation of 5 functions specified above in dnn misc.py. Submit your completed dnn misc.py. We do provide a checking code hw2 dnn check.py to check your

implementation.

Testing dnn misc.py with multi-layer perceptron (MLP)

Q3.3 (2 Points) What to do and submit: run script q33.sh. It will output MLP lr0.01 m0.0 w0.0 d0.0.json.

Add, commit, and push this file before the due date.

What it does: q33.sh will run python3 dnn mlp.py with learning rate 0.01, no momentum, no weight

decay, and dropout rate 0.0. The output file stores the training and validation accuracies over 30 training

epochs.

9

Q3.4 (2 Points) What to do and submit: run script q34.sh. It will output MLP lr0.01 m0.0 w0.0 d0.5.json.

Add, commit, and push this file before the due date.

What it does: q34.sh will run python3 dnn mlp.py –dropout rate 0.5 with learning rate 0.01,

no momentum, no weight decay, and dropout rate 0.5. The output file stores the training and validation

accuracies over 30 training epochs.

Q3.5 (2 Points) What to do and submit: run script q35.sh. It will output MLP lr0.01 m0.0 w0.0 d0.95.json.

Add, commit, and push this file before the due date.

What it does: q35.sh will run python3 dnn mlp.py –dropout rate 0.95 with learning rate 0.01,

no momentum, no weight decay, and dropout rate 0.95. The output file stores the training and validation

accuracies over 30 training epochs.

You will observe that the model in Q3.4 will give better validation accuracy (at epoch 30) compared to

Q3.3. Specifically, dropout is widely-used to prevent over-fitting. However, if we use a too large dropout

rate (like the one in Q3.5), the validation accuracy (together with the training accuracy) will be relatively

lower, essentially under-fitting the training data.

Q3.6 (2 Points) What to do and submit: run script q36.sh. It will output LR lr0.01 m0.0 w0.0 d0.0.json.

Add, commit, and push this file before the due date.

What it does: q36.sh will run python3 dnn mlp nononlinear.py with learning rate 0.01, no momentum, no weight decay, and dropout rate 0.0. The output file stores the training and validation accuracies

over 30 training epochs.

The network has the same structure as the one in Q3.3, except that we remove the relu (nonlinear) layer.

You will see that the validation accuracies drop significantly (the gap is around 0.03). Essentially, without

the nonlinear layer, the model is learning multinomial logistic regression similar to Q2.3.

Testing dnn misc.py with convolutional neural networks (CNN)

Q3.7 (2 Points) What to do and submit: run script q37.sh. It will output CNN lr0.01 m0.0 w0.0 d0.5.json.

Add, commit, and push this file before the due date.

What it does: q37.sh will run python3 dnn cnn.py with learning rate 0.01, no momentum, no weight

decay, and dropout rate 0.5. The output file stores the training and validation accuracies over 30 training

epochs.

Q3.8 (2 Points) What to do and submit: run script q38.sh. It will output CNN lr0.01 m0.9 w0.0 d0.5.json.

Add, commit, and push this file before the due date.

What it does: q38.sh will run python3 dnn cnn.py –alpha 0.9 with learning rate 0.01, momentum

0.9, no weight decay, and dropout rate 0.5. The output file stores the training and validation accuracies over

30 training epochs.

You will see that Q3.8 will lead to faster convergence than Q3.7 (i.e., the training/validation accuracies

will be higher than 0.94 after 1 epoch). That is, using momentum will lead to more stable updates of the

parameters.

## Coding: Building a deeper architecture

Q3.9 (2 Points) The CNN architecture in dnn cnn.py has only one convolutional layer. In this question,

you are going to construct a two-convolutional-layer CNN (see Fig. 4 using the modules you implemented

in Q3.2. Please modify the main function in dnn cnn 2.py. The code in dnn cnn 2.py is similar to that

in dnn cnn.py, except that there are a few parts marked as TODO. You need to fill in your code so as to

construct the CNN in Fig. 4.

10

x yˆ

conv relu max-p conv relu max-p flatten dropout linear softmax

Figure 4: The diagram of the CNN you are going to implement in dnn cnn 2.py. The term conv stands for convolution; max-p stands for max pooling. The circles correspond to variables and edges correspond to modules. Note that the

input to CNN may not be a vector (e.g., in dnn cnn 2.py it is an image, which can be represented as a 3-dimensional

tensor).

The flatten layer is to reshape its input into vector.

What to do and submit: Finish the implementation of the main function in dnn cnn 2.py (search for TODO

in main). Submit your completed dnn cnn 2.py.

Testing dnn cnn 2.py

Q3.10 (2 Points) What to do and submit: run script q310.sh. It will output CNN2 lr0.001 m0.9 w0.0 d0.5.json.

Add, commit, and push this file before the due date.

What it does: q310.sh will run python3 dnn cnn 2.py –alpha 0.9 with learning rate 0.01, momentum 0.9, no weight decay, and dropout rate 0.5. The output file stores the training and validation accuracies

over 30 training epochs.

You will see that you can achieve slightly higher validation accuracies than those in Q3.8.

11