Prove that softmax is invariant to constant offsets in the input, that is, for any input vector x and any
softmax(x) = softmax(x + c)
where x + c means adding the constant c to every dimension of x.
Note: In practice, we make use of this property and choose c = ?maxi xi when computing softmax probabil-
ities for numerical stability (i.e. subtracting its maximum element from all elements of x).
2 Neural Network Basics
(a) Derive the gradients of the sigmoid function and show that it can be rewritten as a function of the
function value (i.e. in some expression where only (x), but not x, is present). Assume that the input
x is a scalar for this question.
(b) Derive the gradient with regard to the inputs of a softmax function when cross entropy loss is used for
evaluation, i.e. nd the gradients with respect to the softmax input vector , when the prediction is
made by ^y = softmax(). Remember the cross entropy function is
CE(y; ^y) = ?
where y is the one-hot label vector, and ^y is the predicted probability vector for all classes. (Hint: you
might want to consider the fact many elements of y are zeros, and assume that only the k-th dimension
of y is one.)
(c) Derive the gradients with respect to the inputs x to an one-hidden-layer neural network (that is, nd
@x where J is the cost function for the neural network). The neural network employs sigmoid activation
function for the hidden layer, and softmax for the output layer. Assume the one-hot label vector is y,
and cross entropy cost is used. (feel free to use ′(x) as the shorthand for sigmoid gradient, and feel free
to dene any variables whenever you see t)
CS 224d: Assignment #1
Recall that the forward propagation is as follows
h = sigmoid(xW1 + b1) ^y = softmax(hW2 + b2)
Note that here we’re assuming that the input vector (thus the hidden variables and output probabilities)
is a row vector to be consistent with the programming assignment. When we apply the sigmoid function
to a vector, we are applying it to each of the elements of that vector. Wi and bi (i = 1; 2) are the weights
and biases, respectively, of the two layers.
(d) How many parameters are there in this neural network, assuming the input is Dx-dimensional, the
output is Dy-dimensional, and there are H hidden units?
(a) Assume you are given a predicted word vector ^r (vwI in the lecture notes in the case of skip-gram), and
word prediction is made with the softmax function found in word2vec models
Pr(wordi j ^r;w) =
where wj (j = 1; : : : ; jV j) are the \output” word vectors for all words in the vocabulary (v′
w in the
lecture notes). Assume cross entropy cost is applied to this prediction and word i is the expected word
(the i-th element of the one-hot label vector is one), derive the gradients with respect to ^r.
(b) In the previous problem, derive gradients for the \output” word vectors wj ‘s (including wi).
(c) Repeat part (a) and (b) assuming we are using the negative sampling loss for the predicted vector ^r, and
the expected output word is wi. Assume that K negative samples are drawn, and they are w1; ;wK,
respectively for simplicity of notation (i =2
f1; : : : ;Kg). Recall that the negative sampling loss function
in this case is
J(^r;wi;w1:::K ) = ?log((w
i ^r)) ?
where () is the sigmoid function.
After you’ve done this, describe with one sentence why this cost function is much more efficient to
compute than the softmax-CE loss (you could provide a speed-up ratio, i.e. the runtime of the softmax-
CE loss divided by the runtime of the negative sampling loss).
Note: the cost function here is the negative of what Mikolov et al had in their original paper, because we
are doing a minimization instead of maximization in our code.
(d) Derive gradients for all of the word vectors for skip-gram and CBOW (optional) given the previous parts,
given a set of context words [wordi?C; : : : ;wordi?1;wordi;wordi+1; : : : ;wordi+C], where C is the context
size. You can denote the \input” and \output” word vectors for wordk as vwk and v′
wk respectively for
Page 2 of 3
CS 224d: Assignment #1
convenience. (Hint: feel free to use F(vwO
j^r) as a placeholder for softmax-CE or negative sampling in
this part | you’ll see that this is a useful abstraction for the coding part.)
Recall that for skip-gram, the cost for a context is
For (a simpler variant of) CBOW, we sum up the input word vectors in the context
then the CBOW cost is
JCBOW(wordi?C:::i+C) = F(v
4 Sentiment Analysis
(a) Explain in less than three sentences why do we want to introduce regularization when doing classication
(in fact, most machine learning tasks).
(b) Plot the classication accuracy on the dev set with respect to the regularization value, using a logarithmic
scale on the x-axis. Brie y explain with less than three sentences what you see in the plot.
Page 3 of 3