## Description

Problem 1. (15 points). Consider one layer of a ReLU network. The feature vector is d dimensional −→x . The linear transformation is a m × d dimensional matrix W. The output of the

ReLU network is a m dimensional vector y given by max{0, W−→x }. This is a component-wise max

function.

• Suppose −→x is fixed, and all its entries are non-zero.

• Suppose the entries in W are all independent, and distributed accoding to a Gaussian distribution with mean 0, and standard deviation 1 (a N(0, 1) distribution).

1. Show that the expected number of non-zero entries in the output is m/2.

2. Suppose k

−→x k

2

2 = σ

2

, what is the distribution of each entry in W x (the output before applying

ReLU function)?

3. What is the mean of each entry in y (after ReLU function)?

Problem 2. (10 points). Consider the setting as in the previous problem, with m = 2, and

d = 2. Let

W =

1 2

−2 3

,

−→x =

2

−3

.

Consider the function L = max n

σ(W(1)

−→x ), σ(W(2)

−→x )

o

, where σ is the Sigmoid function and W(i)

denotes the ith row of W. Please draw the computational graph for this function, and compute

the gradients (which will be Jacobians at some nodes!).

Problem 3. (10 points). Given inputs z1, z2 ∈ R, the softmax function is the following:

yˆ =

e

z1

e

z1 + e

z2

.

1

Let y ∈ {0, 1}, then define the cross-entropy loss between y and ˆy be

L(y, yˆ) = −y log(ˆy) − (1 − y) log(1 − yˆ).

Prove that:

∂L(y, yˆ)

∂z1

= ˆy − y,

∂L(y, yˆ)

∂z2

= y − y. ˆ

Problem 4. (15 points). Consider datapoints in Figure ??: (−2, 0),(2, 0) are crosses, and (0, 2),(0, −2)

are circles. Let the crosses be labeled +1, and the circles be labeled −1. In this problem the goal

−4 −2 2 4

−4

−2

2

4

Figure 1: Neural Networks

is to design a neural network with no error on this dataset.

To make things simple, consider the following generalization. We first append a +1 to each

input and form a new dataset as follows: (−2, 0, 1),(2, 0, 1) are labeled +1, and (0, 2, 1),(0, −2, 1)

are labeled −1. Note that the last feature is redundant.

We consider the following basic units for our neural networks: Linear transformation followed

by hard thresholding. Each unit has three parameters w1, w2, w3. The output of the unit is the

sign of the inner product of the parameters with the input.

1. Design a neural network with these units that make no error on the datapoints above. (Hint:

You can take two units in the first layer, and one in the output layer, a total of three units).

2

2. Show that if you design a neural network with ONLY one such unit, then the points cannot

be all classified correctly.

Problem 5. (40 points). See attached notebook for details.

3