## Description

1 Softmax

Prove that softmax is invariant to constant offsets in the input, that is, for any input vector x and any

constant c,

softmax(x) = softmax(x + c)

where x + c means adding the constant c to every dimension of x.

Note: In practice, we make use of this property and choose c = ?maxi xi when computing softmax probabil-

ities for numerical stability (i.e. subtracting its maximum element from all elements of x).

2 Neural Network Basics

(a) Derive the gradients of the sigmoid function and show that it can be rewritten as a function of the

function value (i.e. in some expression where only (x), but not x, is present). Assume that the input

x is a scalar for this question.

(b) Derive the gradient with regard to the inputs of a softmax function when cross entropy loss is used for

evaluation, i.e. nd the gradients with respect to the softmax input vector , when the prediction is

made by ^y = softmax(). Remember the cross entropy function is

CE(y; ^y) = ?

Σ

i

yi log(^yi)

where y is the one-hot label vector, and ^y is the predicted probability vector for all classes. (Hint: you

might want to consider the fact many elements of y are zeros, and assume that only the k-th dimension

of y is one.)

(c) Derive the gradients with respect to the inputs x to an one-hidden-layer neural network (that is, nd

@J

@x where J is the cost function for the neural network). The neural network employs sigmoid activation

function for the hidden layer, and softmax for the output layer. Assume the one-hot label vector is y,

and cross entropy cost is used. (feel free to use ′(x) as the shorthand for sigmoid gradient, and feel free

to dene any variables whenever you see t)

1

CS 224d: Assignment #1

x

h

^y

Recall that the forward propagation is as follows

h = sigmoid(xW1 + b1) ^y = softmax(hW2 + b2)

Note that here we’re assuming that the input vector (thus the hidden variables and output probabilities)

is a row vector to be consistent with the programming assignment. When we apply the sigmoid function

to a vector, we are applying it to each of the elements of that vector. Wi and bi (i = 1; 2) are the weights

and biases, respectively, of the two layers.

(d) How many parameters are there in this neural network, assuming the input is Dx-dimensional, the

output is Dy-dimensional, and there are H hidden units?

3 word2vec

(a) Assume you are given a predicted word vector ^r (vwI in the lecture notes in the case of skip-gram), and

word prediction is made with the softmax function found in word2vec models

Pr(wordi j ^r;w) =

exp(w⊤

i ^r)

ΣjV j

j=1 exp(w⊤

j ^r)

where wj (j = 1; : : : ; jV j) are the \output” word vectors for all words in the vocabulary (v′

w in the

lecture notes). Assume cross entropy cost is applied to this prediction and word i is the expected word

(the i-th element of the one-hot label vector is one), derive the gradients with respect to ^r.

(b) In the previous problem, derive gradients for the \output” word vectors wj ‘s (including wi).

(c) Repeat part (a) and (b) assuming we are using the negative sampling loss for the predicted vector ^r, and

the expected output word is wi. Assume that K negative samples are drawn, and they are w1; ;wK,

respectively for simplicity of notation (i =2

f1; : : : ;Kg). Recall that the negative sampling loss function

in this case is

J(^r;wi;w1:::K ) = ?log((w

⊤

i ^r)) ?

ΣK

k=1

log((?w

⊤

k ^r))

where () is the sigmoid function.

After you’ve done this, describe with one sentence why this cost function is much more efficient to

compute than the softmax-CE loss (you could provide a speed-up ratio, i.e. the runtime of the softmax-

CE loss divided by the runtime of the negative sampling loss).

Note: the cost function here is the negative of what Mikolov et al had in their original paper, because we

are doing a minimization instead of maximization in our code.

(d) Derive gradients for all of the word vectors for skip-gram and CBOW (optional) given the previous parts,

given a set of context words [wordi?C; : : : ;wordi?1;wordi;wordi+1; : : : ;wordi+C], where C is the context

size. You can denote the \input” and \output” word vectors for wordk as vwk and v′

wk respectively for

Page 2 of 3

CS 224d: Assignment #1

convenience. (Hint: feel free to use F(vwO

j^r) as a placeholder for softmax-CE or negative sampling in

this part | you’ll see that this is a useful abstraction for the coding part.)

Recall that for skip-gram, the cost for a context is

Jskip-gram(wordi?C:::i+C) =

Σ

?cjc;j̸=0

F(v

′

wi+j

jvwi )

For (a simpler variant of) CBOW, we sum up the input word vectors in the context

^r =

Σ

?cjc;j̸=0

vwi+j

then the CBOW cost is

JCBOW(wordi?C:::i+C) = F(v

′

wi

j^r)

4 Sentiment Analysis

(a) Explain in less than three sentences why do we want to introduce regularization when doing classication

(in fact, most machine learning tasks).

(b) Plot the classication accuracy on the dev set with respect to the regularization value, using a logarithmic

scale on the x-axis. Brie y explain with less than three sentences what you see in the plot.

Page 3 of 3