Description
0 Warmup: Boolean Logic
A famous result from the early days of AI (see https://en.wikipedia.org/wiki/Perceptrons_
(book)) proved that a linear classier, such as a single layer of a neural network, cannot compute even
simple functions such as binary XOR. We will show by example that a two-layer neural network, however,
is perfectly capable of learning this pattern.
(a) Let x 2 f0; 1g and y 2 f0; 1g. Express the XOR () function as the composition of other boolean logic
operations (NOT, AND, OR, implication, etc.). Argue graphically (i.e. plot or sketch (x; y; class =
f(x; y)) for x 2 f0; 1g and y 2 f0; 1g) that the operations you use in your composition are possible to
compute with an ordinary (single-layer) linear classier.
(b) Assume that each neuron has a step-function activation, i.e hi(x; y) = (wi1x + wi2y + bi), where:
(x) =
(
1 if x 0
0 if x < 0
Find weights that allow hi(x; y) to compute the single-layer functions from (a). Your classier can have
real-valued weights; only the data need be binary. (Hint: which operation is given by (x + y ? 0:5)?)
(c) Open the IPython notebook part0-XOR.ipynb and follow the instructions, using what you derived in
(a) and (b) to implement a very simple 2-layer network that can compute (i.e. classify) binary XOR.
1
CS 224d: Assignment #2
1 Deep Networks for Named Entity Recognition
This section is based on an assignment originally created for CS 224N. A more detailed writeup can be found
at http://nlp.stanford.edu/˜socherr/pa4_ner.pdf, and may be of use as a refresher.
In this section, we'll get to practice backpropagation and training deep networks to attack the task of
Named Entity Recognition: predicting whether a given word, in context, represents one of four categories:
Person (PER)
Organization (ORG)
Location (LOC)
Miscellaneous (MISC)
We formulate this as a 5-class classication problem, using the four above classes and a null-class (O) for
words that do not represent a named entity (most words fall into this category).
The model is a 2-layer neural network, with an additional representation layer similar to what you saw
with word2vec. Rather than averaging or sampling, here we explicitly represent context as a \window"
consisting of a word concatenated with its immediate neighbors:
x(t) = [Lxt?1; Lxt; Lxt+1] 2 R3d (1)
where the input xt?1; xt; xt+1 are one-hot vectors (really, just indices) into a word-representation matrix
L 2 RdjV j 1, with each column Li as the vector for a particular word i = xt. We then compute our
prediction as:
h = tanh(Wx(t) + b1) (2)
^y = softmax(Uh + b2) (3)
And evaluate by cross-entropy loss
J() = ?
X5
k=1
yk log ^yk (4)
where y 2 R5 is a one-hot label vector. To compute the loss for the training set, we sum (or average) this
J() as computed with respect to each training example.
For this problem, we let d = 50 be the length of our word vectors, which are concatenated into a win-
dow of width 350 = 150. The hidden layer has a dimension of 100, and the output layer ^y has a dimension
of 5.
(a) Compute the gradients of J() with respect to all the model parameters:
@J
@U
@J
@b2
@J
@W
@J
@b1
@J
@Li
where
U 2 R5x100 b2 2 R5 W 2 R100x150 b1 2 R100 Li 2 R50
1In the code, we'll implement this with word vectors as rows for eciency, and so you can access Li as L[i]. See
nerwindow.py for more details.
Page 2 of 8
CS 224d: Assignment #2
In the spirit of backpropagation, you should express the derivative of activation functions (tanh, softmax)
in terms of their function values (as with sigmoid in Assignment 1). This identity may be helpful:
tanh(z) = 2 sigmoid(2z) ? 1
Furthermore, you should express the gradients by using an \error vector" propagated back to each layer;
this just amounts to putting parentheses around factors in the chain rule, and will greatly simplify
your analysis. All resulting gradients should have simple, closed-form expressions in terms of matrix
operations. (Hint: you've already done most of the work here as part of Assignment 1.)
(b) To avoid parameters from exploding or becoming highly correlated, it is helpful to augment our cost
function with a Gaussian prior: this tends to push parameter weights closer to zero, without constraining
their direction, and often leads to classiers with better generalization ability.
If we maximize log-likelihood (as with the cross-entropy loss, above), then the Gaussian prior becomes
a quadratic term 2 (L2 regularization):
Jreg() =
2
2
4
X
i;j
W2
ij +
X
i0j0
U2
i0j0
3
5 (5)
and we optimize the combined loss function
Jfull() = J() + Jreg() (6)
Update your gradients from part (a) to include the additional term in this loss function (i.e. compute
dJfull
dW , etc.).
(c) In order to avoid neurons becoming too correlated and ending up in poor local minimina, it is often
helpful to randomly initialize parameters. Empirically, the following has been found to work well:
Given a matrix of A of dimension m n, select values Aij uniformly from [?; ], where
=
p
6
p
m + n
(7)
Implement the function random weight matrix(m,n) in misc.py to perform this initialization. A
cell is provided to test this code in part1-NER.ipynb.
(d) Open the notebook part1-NER.ipynb and follow the instructions to implement the NER window
model, using the gradients you derived in (a) and (b). You'll also want to take a look at the example
classier in softmax example.py for a guide on how to implement your model using our starter code.
Deliverables:
Working implementation of the NER window model, in nerwindow.py. (We'll look at, and
possibly run this code for grading.)
In your writeup (i.e. where you're writing the answers to the written problems), brie y state
the optimal hyperparameters you found for your model: regularization, dimensions, learning rate
(including time-varying, such as annealing), SGD batch size, etc.
2Optional (not graded): The interested reader should prove that this is indeed the maximum-likelihood objective when
we let Wij N(0; 1=) for all i; j.
Page 3 of 8
CS 224d: Assignment #2
(e) In the notebook, follow the instructions to plot learning curves for your best model, and for a comparison
of learning rates.
Deliverables:
Plot of the learning curve for your best model, in ner.learningcurve.best.png.
Plot comparing = 0:01 to = 0:1 in ner.learningcurve.comparison.png.
(f) In the notebook, follow the instructions to evaluate your model's performance on the dev set, and
compute predictions on the test data. Note that the test set has only dummy labels; we'll compare your
predictions against the ground truth after you submit.
Note that you should compute F1 scores by a weighted average across all classes except "O", since
this null class is not of interest for practical applications. The function eval performance() in
nerwindow.py will do this for you.
Deliverables:
Report, in your writeup, the performance of your model on the dev set (as output by eval performance()).
List of predicted labels for the test set, one per line, in the le test.predicted.
1.1 Deep Networks: Probing Neuron Responses
Still in the part1-NER.ipynb notebook, follow the instructions to \probe" the responses of the hidden
and output neurons in your network. You should report the following in your writeup:
(a) Top-10 word lists for the center word, on 5 hidden layer neurons of your choice.
(b) Top-10 word lists for the center word, on model output for PER, ORG, LOC, and MISC.
(c) Top-10 word lists for the rst word (preceding the center word), on model output for PER, ORG, LOC,
and MISC.
For each, give a brief (no more than 2 sentence) comment on what the model appears to learn.
2 Recurrent Neural Networks: Language Modeling
In this section, you'll implement your rst recurrent neural network (RNN3) and use it to build a language
model.
Language modeling is a central task in NLP, and language models can be found at the heart of speech
recognition, machine translation, and many other systems. Given words x1; : : : ; xt, a language model pre-
dicts the following word xt+1 by modeling:
P(xt+1 = vj j xt; : : : ; x1)
where vj is a word in the vocabulary.
Your job is to implement a recurrent neural network language model, which uses feedback information
in the hidden layer to model the \history" xt; xt?1; : : : ; x1. Formally, the model4 is, for t = 1; : : : ; n ? 1:
3We'll start talking about recursive neural networks soon and also call these RNNs - but it turns out that recurrent nets are
just a special case of recursive nets, so there's actually nothing ambiguous!
4This model is adapted from a paper by Toma Mikolov, et al. from 2010: http://www.fit.vutbr.cz/research/groups/
speech/publi/2010/mikolov_interspeech2010_IS100722.pdf
Page 4 of 8
CS 224d: Assignment #2
h(t) = sigmoid
Hh(t?1) + Lx(t)
(8)
^y(t) = softmax
Uh(t)
(9)
P(xt+1 = vj j xt; : : : ; x1) = ^y(t)
j (10)
where h(0) = h0 2 RDh is some initialization vector for the hidden layer and Lx(t) is the product of L with
the one-hot vector x(t) representing index of the current word 5. The parameters are:
H 2 RDhDh L 2 RDhjV j U 2 RjV jDh (11)
where L is the input word representation matrix and U is the output word representation matrix, and Dh
is the dimension of the hidden layer.
The output vector ^y(t) 2 RjV j is a probability distribution over the vocabulary, and we optimize the (un-
regularized) cross-entropy loss:
J(t)() = ?
XjV j
j=1
y(t)
j log ^y(t)
j (12)
where y(t) is the one-hot vector corresponding to the target word (which here is equal to xt+1). As in Part 1,
this is a point-wise loss, and we sum (or average) the cross-entropy loss across all examples in a sequence,
across all sequences6 in the dataset in order to evaluate model performance.
(a) Conventionally, when reporting performance of a language model, we evaluate on perplexity, which is
dened as:
PP(t)
^y(t); y(t)
=
1
P(xpred
t+1 = xt+1 j xt; : : : ; x1)
=
1
PjV j
j=1 y(t)
j ^y(t)
j
(13)
i.e. the inverse probability of the correct word, according to the model distribution P. Show how you can
derive perplexity from the cross-entropy loss (Hint: remember that y(t) is one-hot! ), and thus argue that
minimizing the (arithmetic) mean cross-entropy loss will also minimize the (geometric) mean perplexity
across the training set. This should be a very short problem - not too perplexing!
For a vocabulary of jV j words, what would you expect perplexity to be if your model predictions were
completely random? Compute the corresponding cross-entropy loss for jV j = 2000 and jV j = 10000, and
keep this in mind as a baseline.
(b) As you did in part 1, compute the gradients with for all the model parameters at a single point in time
t:
@J(t)
@U
@J(t)
@Lx(t)
@J(t)
@H
(t)
where Lx(t) is the column of L corresponding to the current word x(t), and
(t) denotes the gradient for
the appearance of that parameter at time t. (Equivalently, h(t?1) is taken to be xed, and you need not
backpropagate to earlier timesteps just yet - you'll do that in part (c)).
Additionally, compute the derivative with respect to the previous hidden layer value:
5As in Part 1, in the code it is more convenient to represent L as a "tall" matrix and access rows as L[x[t]].
6We implement this for you in compute mean loss in rnnlm.py.
Page 5 of 8
CS 224d: Assignment #2
@J(t)
@h(t?1)
(c) Below is a sketch of the network at a single timestep:
x(t)
h(t?1) h(t)
^y(t)
:::
Draw the \unrolled" network for 3 timesteps, and compute the backpropagation-through-time gradients:
@J(t)
@Lx(t?1)
@J(t)
@H
(t?1)
where
(t?1) denotes the gradient for the appearance of that parameter at time (t ? 1). Because param-
eters are used multiple times in feed-forward computation, we need to compute the gradient for each
time they appear.
You should use the backpropagation rules from Lecture 5 7 to express these derivatives in terms of
an error term (t), such that you can re-use expressions for t ? 2, t ? 3, and so on.
Note that the true gradient with respect to a training example requires us to run backpropagation
all the way back to t = 0. In practice, however, we generally truncate this and only backpropagate for
a xed number 3 ? 5 timesteps.
(d) Given h(t?1), how many operations are required to perform one step of forward propagation to compute
J(t)()? How about backpropagation for a single step in time? For steps in time? Express your answer
in big-O notation in terms of the dimensions Dh and jV j of the matricies L, H, and U (Equation 11).
What is the slow step?
Bonus: Given your knowledge of similar models (i.e. word2vec), suggest a way to speed up this
part of the computation. Your approach can be an approximation, but you should argue why it's a good
one. The paper \Extensions of recurrent neural network language model" (Mikolov, et al. 2013) may be
of interest here.
(e) Implement the above model in rnnlm.py. You'll need to implement just three functions, for now:
init () (not much to do here)
acc grads()
compute seq loss()
Data loaders and other starter code is provided in the part2-RNNLM.ipynb notebook, and you should
use this to verify your implementation.
7http://cs224d.stanford.edu/lectures/CS224d-Lecture5.pdf
Page 6 of 8
CS 224d: Assignment #2
Be sure to read the instructions carefully in the starter code! They describe the data format and how to
run your model over a sequence. Particuarly, you should sum the pointwise costs J(t)() over a sequence.
When accumulating gradients, you should also add up all the gradients you compute for J(t)() for each
target word in the sequence. (This is basically minibatch SGD.)
(f) Train a model on the ptb-train data, consisting of the rst 20 sections of the WSJ corpus of the Penn
Treebank. For speed, we recommend using a small vocabulary of 2000-5000 words.
As in Part1, you should tune your model to maximize generalization performance (minimize cross-
entropy loss) on the dev set. We'll evaluate your model on an unseen, but similar set of sentences.
Deliverables:
In your writeup, include the best hyperparameters you used (training schedule, number of iterations,
learning rate, backprop timesteps), and your perplexity score on the ptb-dev set.
Model parameters saved as rnnlm.U.npy, rnnlm.H.npy, and rnnlm.L.npy; we'll use these to
test your model.
(g) The networks that you've seen in Assignment 1 and in Part1 of this assignment are discriminative mod-
els: they take data, and make a prediction. The RNNLM model you've just implemented is a generative
model, in that it actually models the distribution of the data sequence x1; : : : ; xn. This means that not
only can we use it to evaluate the likelihood of a sentence, but we can actually use it to generate one!
In rnnlm.py, implement the generate sequence() function. This should run the RNN forward
in time, beginning with the index for the start token