# CSCI 567 Homework #2 Programming Assignments solution

\$35.00

Original Work ?
Category:

5/5 - (1 vote)

## Problem 1 High-level descriptions

1.1 Dataset (Same as in Homework

1.) We will use mnist subset (images of handwritten digits from 0
to 9). The dataset is stored in a JSON-formated file mnist subset.json. You can access its training, validation, and test splits using the keys ‘train’, ‘valid’, and ‘test’, respectively. For example, suppose we load
mnist subset.json to the variable x. Then, x[
0
train0
] refers to the training set of mnist subset. This set is a
list with two elements: x[
0
train0
][0] containing the features of size N (samples) ×D (dimension of features),
and x[
0
train0
][1] containing the corresponding labels of size N.

Besides, for logistic regression in Sect. 2, you will be using synthetic datasets with two, three and five
classes.

You will be asked to implement binary and multiclass classification (Sect. 2) and neural networks (Sect. 3).

Specifically, you will
• finish the implementation of all python functions in our template codes.
• run your code by calling the specified scripts to generate output files.
• add, commit, and push (1) all *.py files, and (2) all *.json and *.out files that you have amended
or created.

In the next two subsections, we will provide a high-level checklist of what you need to do. You are not
refer to text in Sect. 2 and Sect. 3, as well as corresponding python scripts.

1.2.1 Logistic regression

Coding In logistic.py, finish implementing the following functions: binary train, binary predict,
multinomial train, multinomial predict, ovr train and ovr predict. Refer to logistic.py
Running your code Run the scripts logistic binary.sh and logistic multiclass.sh after you
finish your implementation. This will output:
• logistic binary.out
• logistic multiclass.out
What to submit Submit logistic.py, logistic binary.out, logistic multiclass.out.

1.2.2 Neural networks

Preparation Read Sect. 3 as well as dnn mlp.py and dnn cnn.py.
Coding First, in dnn misc.py, finish implementing
• forward and backward functions in class linear layer
• forward and backward functions in class relu
• backward function in class dropout (before that, please read forward function).

Second, in dnn cnn 2.py, finish implementing the main function. There are five TODO items. Refer to
Running your code Run the scripts q33.sh, q34.sh, q35.sh, q36.sh, q37.sh, q38.sh, q310.sh after
you finish your implementation. This will generate, respectively,
3
MLP lr0.01 m0.0 w0.0 d0.0.json
MLP lr0.01 m0.0 w0.0 d0.5.json
MLP lr0.01 m0.0 w0.0 d0.95.json
LR lr0.01 m0.0 w0.0 d0.0.json
CNN lr0.01 m0.0 w0.0 d0.5.json
CNN lr0.01 m0.9 w0.0 d0.5.json
CNN2 lr0.001 m0.9 w0.0 d0.5.json
What to submit Submit dnn misc.py, dnn cnn 2.py, and the above seven .json files.

1.3 Cautions
• Do not import packages that are not listed above (See Python Packages section).
• Follow the instructions in each section strictly to code up your solutions.
• DO NOT CHANGE THE OUTPUT FORMAT.
• DO NOT MODIFY THE CODE UNLESS WE INSTRUCT YOU TO DO SO.

• A homework solution that mismatches the provided setup, such as format, name, initializations, etc.,
• It is your responsibility to make sure that your code runs with Python 3.5.2 in the VM.
1.4 Advice We are extensively using softmax and sigmoid function in this homework. To avoid numerical
issues such as overflow and underflow caused by numpy.exp() and numpy.log(), please use the following
implementations:

• Let x be a input vector to the softmax function. Use ˜x = x − max(x) instead of using x directly for the
softmax function f . That is, if you want to compute f(x)i
, compute f(x˜)i =
exp(x˜i
)

D
j=1
exp(x˜j
)
is clearly mathematically equivalent but numerically more stable.

• If you are using numpy.log(), make sure the input to the log function is positive. Also, there may
be chances that one of the outputs of softmax, e.g. f(x˜)i
, is extremely small but you need the value
ln(f(x˜)i). In this case you should convert the computation equivalently into ˜xi − ln(∑
D
j=1
exp(x˜j)).
We have implemented and run the code ourselves without problems, so if you follow the instructions
and settings provided in the python files, you should not encounter overflow or underflow.
4

## Problem 2 Logistic Regression (20 Points)

For this assignment you are asked to implement Logistic Regression for binary and multiclass classification.

Q2.1 (6 Points)
In lecture 3 we discussed logistic regression for binary classification. In this problem, you are given a
training set D =

(xn, yn)
N
n=1

, where yi ∈ {0, 1} ∀i = 1…N. Important: note that here the binary labels are
not −1 or +1 as used in the lecutre, so be very careful about applying formulas from the lecture notes.

Tx + b that minimizes the logistic loss. Note that we
do not explicitly append the feature 1 to the data, so you need to explicitly learn the bias/intercept term
b too.

Specifically you need to implement function binary train in logistic.py which uses gradient
descent (not stochastic gradient descent) to find the optimal parameters (recall logistic regression does not

In addition you need to implement function binary predict in logistic.py. We discuss two ways
of making predictions in logistic regression in lecture 4: deterministic prediction or randomized prediction.
Here you need to use the deterministic prediction.

After finishing implementation, please run logistic binary.sh which generates logistic binary.out.
What to submit:
• logistic.py
• logistic binary.out

Q2.2 (7 Points) In the lectures you learned several methods to perform multiclass classification. One of
them was one-versus-rest or one-versus-all approach.

For one-versus-rest classification in a problem with K classes, we need to train K classifiers using a
black-box. Classifier k is trained on a binary problem, where the two labels corresponds to belonging or
not belonging to class k. After that, the multiclass prediction is made based on the combination of all
predictions from K binary classifiers.

In this problem you will implement one-versus-rest using binary logistic regression (that you have implemented in Q2.1) as the black-box. Important: the way to predict discussed in the lecture is to randomized
over the classifiers that say “yes”; however, here since binary logistic regression naturally predicts a probability for each class (recall the sigmoid model), we will simply predict the class with the highest probability
(using numpy argmax).

To sum up, you need to complete functions OVR train and OVR predict to perform one-versus-rest
classification. After you finished implementation, please run logistic multiclass.sh script, which
will produce logistic multiclass.out.
What to submit: logistic.py and logistic multiclass.out.

Q2.3 (7 Points) Yet another multiclass classification method you learned was multinomial logistic regression.

Complete the functions multinomial train and multinomial predict to perform multinomial
logistic regression, following the same notes as in Q2.1, that is, 1) explicitly learn the biased term; 2) perform
After you finished implementation, please run logistic multiclass.sh script, which will produce
logistic multiclass.out.

What to submit: logistic.py and logistic multiclass.out.
5
x
input features
u h a z yˆ
predicted label
linear(1)
relu linear(2)
softmax
Figure 1: A diagram of a multi-layer perceptron (MLP). The edges mean mathematical operations (modules), and the circles
mean variables. The term relu stands for rectified linear units.

## Problem 3 Neural networks: multi-layer perceptrons (MLPs) and convolutional neural networks (CNNs) (30 Points)

Background
In recent years, neural networks have been one of the most powerful machine learning models. Many toolboxes/platforms (e.g., TensorFlow, PyTorch, Torch, Theano, MXNet, Caffe, CNTK) are publicly available
for efficiently constructing and training neural networks. The core idea of these toolboxes is to treat a neural
network as a combination of data transformation (or mathematical operation) modules.

For example, in Fig. 1 we provide a diagram of a multi-layer perceptron (MLP, just another term for
fully connected feedforward networks we discussed in the lecture) for a K-class classification problem.

The edges correspond to modules and the circles correspond to variables. Let (x ∈ RD, y ∈ {1, 2, · · · , K}) be a
labeled instance, such an MLP performs the following computations
input features : x ∈ R
D (1)
linear(1)
: u = W(1)
x + b
(1)
,W(1) ∈ R
M×D and b
(1) ∈ R
M (2)
relu : h = max{0, u} =

max{0, u1}
.
.
.
max{0, uM}

 (3)
linear(2)
: a = W(2)h + b
(2)
,W(2) ∈ R
K×M and b
(2) ∈ R
K
(4)
softmax : z =

e
a1
∑k
e
ak
.
.
.
e
aK
∑k
e
ak

(5)
predicted label : yˆ = arg maxk
zk
. (6)
For a K-class classification problem, one popular loss function for training (i.e., to learn W(1)
, W(2)
, b
(1)
,
b
(2)
) is the cross-entropy loss. Specifically we denote the cross-entropy loss with respect to the training
example (x, y) by l:
l = − log(zy) = log
1 + ∑
k6=y
e
ak−ay
!

Note that one should look at l as a function of the parameters of the network, that is, W(1)
, b
(1)
,W(2) and
b
(2)

. For ease of notation, let us define the one-hot (i.e., 1-of-K) encoding of a class y as
6
y ∈ R
K
and yk =
(
1, if y = k,
0, otherwise.
(7)
so that
l = − ∑
k
yk
log zk = −y
T

log z1
.
.
.
log zK

 = −y
T
log z. (8)

We can then perform error-backpropagation, a way to compute partial derivatives (or gradients) w.r.t
the parameters of a neural network, and use gradient-based optimization to learn the parameters.

Modules
Now we will provide more information on modules for this assignment. Each module has its own parameters (but note that a module may have no parameters). Moreover, each module can perform a forward
pass and a backward pass.

The forward pass performs the computation of the module, given the input to
the module. The backward pass computes the partial derivatives of the loss function w.r.t. the input and
parameters, given the partial derivatives of the loss function w.r.t. the output of the module. Consider a
module hmodule namei.

Let hmodule namei.forward and hmodule namei.backward be its forward and
backward passes, respectively.

For example, the linear module may be defined as follows.
forward pass: u = linear(1)
.forward(x) = W(1)
x + b
(1)
, (9)
where W(1)
and b
(1)
are its parameters.
backward pass: [
∂l
∂x
,
∂l
∂W(1)
,
∂l
∂b
(1)
] = linear(1)
.backward(x,
∂l
∂u
). (10)

Let us assume that we have implemented all the desired modules. Then, getting ˆy for x is equivalent to
running the forward pass of each module in order, given x. All the intermediated variables (i.e., u, h, etc.)
will all be computed along the forward pass. Similarly, getting the partial derivatives of the loss function
w.r.t. the parameters is equivalent to running the backward pass of each module in a reverse order, given
∂l
∂z
.

In this question, we provide a Python environment based on the idea of modules. Every module is
defined as a class, so you can create multiple modules of the same functionality by creating multiple object
instances of the same class.

Your work is to finish the implementation of several modules, where these
modules are elements of a multi-layer perceptron (MLP) or a convolutional neural network (CNN). We
will apply these models to the same 10-class classification problem introduced in Sect. 2. We will train
the models using stochastic gradient descent with mini-batch, and explore how different hyperparameters
of optimizers and regularization techniques affect training and validation accuracies over training epochs.

For deeper understanding, check out, e.g., the seminal work of Yann LeCun et al. “Gradient-based learning
applied to document recognition,” written in 1998.

We give a specific example below. Suppose that, at iteration t, you sample a mini-batch of N examples
{(xi ∈ RD, yi ∈ RK)}
N
i=1
from the training set (K = 10). Then, the loss of such a mini-batch given by Fig. 1
is
7
x yˆ
linear(1)
relu dropout linear(2)
softmax
Figure 2: The diagram of the MLP implemented in dnn mlp.py. The circles mean variables and edges mean modules.
lmb =
1
N
N

i=1
l(softmax.forward(linear(2)
.forward(relu.forward(linear(1)
.forward(xi)))), yi) (11)
=
1
N
N

i=1

l(softmax.forward(linear(2)
.forward(relu.forward(ui))), yi) (12)
= · · · (13)
=
1
N
N

i=1
l(softmax.forward(ai), yi) (14)
=
1
N
N

i=1
K

k=1
yik log zik. (15)

That is, in the forward pass, we can perform the computation of a certain module to all the N input examples, and then pass the N output examples to the next module. This is the same case for the backward pass.
For example, according to Fig. 1, given the partial derivatives of the loss w.r.t. {ai}
N
i=1
∂lmb
∂{ai}
N
i=1
=

(
∂lmb
∂a1
)
T
(
∂lmb
∂a2
)
T
.
.
.
(
∂lmb
∂aN−1
)
T
(
∂lmb
∂aN
)
T

, (16)
linear(2)
.backward will compute
∂lmb
∂{hi}
N
i=1
and pass it back to relu.backward.

Preparation
Q3.1 Please read through dnn mlp.py and dnn cnn.py. Both files will use modules defined in dnn misc.py
(which you will modify). Your work is to understand how modules are created, how they are linked to
perform the forward and backward passes, and how parameters are updated based on gradients (and momentum). The architectures of the MLP and CNN defined in dnn mlp.py and dnn cnn.py are shown in
Fig. 2 and Fig. 3, respectively.
What to submit: Nothing.
Coding: Modules
8
x yˆ
convolution relu max pooling flatten dropout linear softmax

Figure 3: The diagram of the CNN implemented in dnn cnn.py. The circles correspond to variables and edges
correspond to modules. Note that the input to CNN may not be a vector (e.g., in dnn cnn.py it is an image, which can
be represented as a 3-dimensional tensor). The flatten layer is to reshape its input into vector.

Q3.2 (14 Points) You will modify dnn misc.py. This script defines all modules that you will need to
construct the MLP and CNN in dnn mlp.py and dnn cnn.py, respectively. You have three tasks. First,
finish the implementation of forward and backward functions in class linear layer. Please follow
Eqn. (2) for the forward pass and derive the partial derivatives accordingly.

Second, finish the implementation of forward and backward functions in class relu. Please follow Eqn. (3) for the forward pass
and derive the partial derivatives accordingly. Third, finish the the implementation of backward function
in class dropout. We define the forward and the backward passes as follows.
forward pass: s = dropout.forward(q ∈ R
J
) = 1
1 − r
×

1[p1 >= r] × q1
.
.
.
1[pJ >= r] × qJ

 , (17)
where pj
is sampled uniformly from [0, 1), ∀j ∈ {1, · · · , J},
and r ∈ [0, 1) is a pre-defined scalar named dropout rate. (18)
backward pass: ∂l
∂q
= dropout.backward(q,
∂l
∂s
) = 1
1 − r
×

1[p1 >= r] ×
∂l
∂s1
.
.
.
1[pJ >= r] ×
∂l
∂sJ

. (19)

Note that pj
, j ∈ {1, · · · , J} and r are not be learned so we do not need to compute the derivatives w.r.t.
to them. Moreover, pj
, j ∈ {1, · · · , J} are re-sampled every forward pass, and are kept for the following
backward pass. The dropout rate r is set to 0 during testing.

Detailed descriptions/instructions about each pass (i.e., what to compute and what to return) are included in dnn misc.py. Please do read carefully.
Note that in this script we do import numpy as np. Thus, to call a function XX from numpy, please
use np.XX.
What to do and submit: Finish the implementation of 5 functions specified above in dnn misc.py. Submit your completed dnn misc.py. We do provide a checking code hw2 dnn check.py to check your
implementation.
Testing dnn misc.py with multi-layer perceptron (MLP)

Q3.3 (2 Points) What to do and submit: run script q33.sh. It will output MLP lr0.01 m0.0 w0.0 d0.0.json.
Add, commit, and push this file before the due date.
What it does: q33.sh will run python3 dnn mlp.py with learning rate 0.01, no momentum, no weight
decay, and dropout rate 0.0. The output file stores the training and validation accuracies over 30 training
epochs.
9

Q3.4 (2 Points) What to do and submit: run script q34.sh. It will output MLP lr0.01 m0.0 w0.0 d0.5.json.
Add, commit, and push this file before the due date.
What it does: q34.sh will run python3 dnn mlp.py –dropout rate 0.5 with learning rate 0.01,
no momentum, no weight decay, and dropout rate 0.5. The output file stores the training and validation
accuracies over 30 training epochs.

Q3.5 (2 Points) What to do and submit: run script q35.sh. It will output MLP lr0.01 m0.0 w0.0 d0.95.json.
Add, commit, and push this file before the due date.
What it does: q35.sh will run python3 dnn mlp.py –dropout rate 0.95 with learning rate 0.01,
no momentum, no weight decay, and dropout rate 0.95. The output file stores the training and validation
accuracies over 30 training epochs.

You will observe that the model in Q3.4 will give better validation accuracy (at epoch 30) compared to
Q3.3. Specifically, dropout is widely-used to prevent over-fitting. However, if we use a too large dropout
rate (like the one in Q3.5), the validation accuracy (together with the training accuracy) will be relatively
lower, essentially under-fitting the training data.

Q3.6 (2 Points) What to do and submit: run script q36.sh. It will output LR lr0.01 m0.0 w0.0 d0.0.json.
Add, commit, and push this file before the due date.

What it does: q36.sh will run python3 dnn mlp nononlinear.py with learning rate 0.01, no momentum, no weight decay, and dropout rate 0.0. The output file stores the training and validation accuracies
over 30 training epochs.

The network has the same structure as the one in Q3.3, except that we remove the relu (nonlinear) layer.
You will see that the validation accuracies drop significantly (the gap is around 0.03). Essentially, without
the nonlinear layer, the model is learning multinomial logistic regression similar to Q2.3.
Testing dnn misc.py with convolutional neural networks (CNN)

Q3.7 (2 Points) What to do and submit: run script q37.sh. It will output CNN lr0.01 m0.0 w0.0 d0.5.json.
Add, commit, and push this file before the due date.

What it does: q37.sh will run python3 dnn cnn.py with learning rate 0.01, no momentum, no weight
decay, and dropout rate 0.5. The output file stores the training and validation accuracies over 30 training
epochs.

Q3.8 (2 Points) What to do and submit: run script q38.sh. It will output CNN lr0.01 m0.9 w0.0 d0.5.json.
Add, commit, and push this file before the due date.

What it does: q38.sh will run python3 dnn cnn.py –alpha 0.9 with learning rate 0.01, momentum
0.9, no weight decay, and dropout rate 0.5. The output file stores the training and validation accuracies over

30 training epochs.

You will see that Q3.8 will lead to faster convergence than Q3.7 (i.e., the training/validation accuracies
will be higher than 0.94 after 1 epoch). That is, using momentum will lead to more stable updates of the
parameters.

## Coding: Building a deeper architecture

Q3.9 (2 Points) The CNN architecture in dnn cnn.py has only one convolutional layer. In this question,
you are going to construct a two-convolutional-layer CNN (see Fig. 4 using the modules you implemented
in Q3.2. Please modify the main function in dnn cnn 2.py. The code in dnn cnn 2.py is similar to that
in dnn cnn.py, except that there are a few parts marked as TODO. You need to fill in your code so as to
construct the CNN in Fig. 4.

10
x yˆ
conv relu max-p conv relu max-p flatten dropout linear softmax
Figure 4: The diagram of the CNN you are going to implement in dnn cnn 2.py. The term conv stands for convolution; max-p stands for max pooling. The circles correspond to variables and edges correspond to modules. Note that the
input to CNN may not be a vector (e.g., in dnn cnn 2.py it is an image, which can be represented as a 3-dimensional
tensor).

The flatten layer is to reshape its input into vector.
What to do and submit: Finish the implementation of the main function in dnn cnn 2.py (search for TODO
in main). Submit your completed dnn cnn 2.py.
Testing dnn cnn 2.py

Q3.10 (2 Points) What to do and submit: run script q310.sh. It will output CNN2 lr0.001 m0.9 w0.0 d0.5.json.
Add, commit, and push this file before the due date.

What it does: q310.sh will run python3 dnn cnn 2.py –alpha 0.9 with learning rate 0.01, momentum 0.9, no weight decay, and dropout rate 0.5. The output file stores the training and validation accuracies
over 30 training epochs.
You will see that you can achieve slightly higher validation accuracies than those in Q3.8.
11