## Description

Q1 (10 points): Let f

0

(x) denote the derivative of a function f(x) w.r.t. the variable x.

(a) 2 pts: What does f’(x) intend to measure?

(b) 2 pts: Let h(x) = f(g(x)). What is h

0

(x)?

(c) 2 pts: Let h(x) = f(x)g(x). What is h

0

(x)?

(d) 2 pts: Let f(x) = a

x

, where a > 0. What is f

0

(x)?

(e) 2 pts: Let f(x) = x

10 − 2x

8 +

4

x2 + 10. What is f

0

(x)?

Q2 (15 points): The logistic function is f(x) = 1

1+e−x . The tanh function is g(x) = e

x−e−x

e

x+e−x .

1

(a) 5 pts: Prove that f

0

(x) = f(x)(1 − f(x)).

(b) 5 pts: Prove that g

0

(x) = 1 − g

2

(x).

(c) 5 pts: Prove that g(x) = 2f(2x) − 1

Q3 (15 points): Let us denote the partial derivative of a multi-variate function f w.r.t. one of its

variables x by f

0

x or df

dx .

(a) 2 pts: What is f

0

x

trying to measure?

(b) 2 pts: Let f(x, y) = x

3 + 3x

2y + y

3 + 2x. What is f

0

x

? What is f

0

y

?

(c) 2 pts: Let z =

Pn

i=1 wixi

. What is dz

dwi

?

(d) 4 pts: Let f(z) = 1

1+e−z and z =

Pn

i=1 wixi

.

What is df

dz ?

What is df

dwi

?

Hint: Use the answers that contain f(z).

(e) 5 pts: Let E(z) = 1

2

(t−f(z))2

, f(z) = 1

1+e−z and z =

Pn

i=1 wixi

. What is dE

dwi

? Hint: the answer

should contain f(z).

Q4 (10 points): The softmax function:

(a) 5 pts: In general where in NNs is the softmax function used and why?

(b) 5 pts: If a vector x is [1, 2, 3, -1, -4, 0], what is the value of softmax(x)?

Q5 (15 points): Suppose a feedforward neural network has m layers: the input layer is the 1st layer,

the output layer is the last layer, and there are m−2 hidden layers in between. The number of neurons

in the i

th layer is ni

. Each neuron in one layer is connected to every neuron in the next layer and

there is no other connection.

(a) 5 pts: How many connections (i.e., weights) are there in this network?

(b) 10 pts: Let x be a column vector that denotes the values of the input layer. Let Mk denote the

weight matrix between layer k and k + 1; that is, the cell ai,j in Mk stores the weight on the arc

from the j

th neuron in layer k to the i

th neuron in layer k + 1. Let g be the activation function

used in each layer.

• Given the input x, what is the formula for calculating the output of the first hidden layer?

• Given the input x, what is the formula for calculating the output of the output layer?

• Hint: In class, we show the formula for calculating the z and y value for a neuron, where

z = b +

P

j wjxj and y = g(z). Now there are n2 neurons in the 2nd layer. The output of

this layer, y, is going to be a column vector, not a real number. The weights between the

two layers are no longer a vector, but a n2 × n1 matrix denoted by M1.

So the answer to

the 1st question should be a simple formula that uses matrix operations. For the sake of

simplicity, let’s assume the bias b is always zero.

• Terminology: A row vector is a 1×n matrix (e.g., [a1, a2, …, an]); a column vector is a n×1

matrix. If you transpose a row vector, you get a column vector.

Q6 (40 points): Read Chapter 1 of the NN book, and answer the following based on that chapter:

(a) 5 pts: What’s the loss function used in the digit recognition task? Why do they choose to minimize this function instead of maximizing classification accuracy?

(b) 10 pts: In gradient descent, what’s the formula for updating the weight matrix (or vector)? And

why is that a good formula?

(c) 15 pts: What are the main idea and benefit of stochastic gradient descent?

What is a training epoch?

Let T be the size of the training data, m be the size of mini-batch, and your training process

contains E training epoches. How many times is each weight in the NN updated?

(d) 10 pts: How can one choose the learning rate? What’s the risk if the rate is too big? What’s

the risk if the rate is too small?

Q7 (25 points): Go over the source code in Nielson’s package stored under /dropbox/18-19/572/hw9/nielsennn/ on patas and understand the part explained in Chapter 1 of the NN book.

• Run the code (following the instructions in chapter 1) and fill out Table 1. For this exercise, use

only one hidden layer.

• It seems that the code works with python 2.*, not with python 3.*. If you run the default python

version on patas, which is 2.7.5, the code should work.

• Note that as the package uses random functions a few times, your results will not be the same

when running it multiple times.

Table 1: Results on digit recognition

Expt id # of hidden neurons epoch # mini batch size learning rate accuracy

1 30 30 10 3.0

2 10 30 10 3.0

3 30 30 10 0.5

4 30 30 10 10

5 30 30 100 3.0

Submission: Submit the following to Canvas:

• Since hw9 has no coding part, you only need to submit your readme.pdf which includes answers

to all the questions, plus anything you want TA to know. No need to submit anything else.