CS 224d: Assignment #3 solution

$29.99

Original Work ?
Category: You will Instantly receive a download link for .ZIP solution file upon Payment

Description

5/5 - (2 votes)

1 RNN’s (Recursive Neural Network)
Welcome to SAIL (Stanford Arti cial Intelligence Lab): Congrats! You have just been given a
Research Assistantship in SAIL. Your task is to discover the power of Recursive Neural Networks (RNNs).
So you plan an experiment to show their e ectiveness on Positive/Negative Sentiment Analysis. In this part,
you will derive the forward and backpropogation equations, implement them, and test the results.
We will assume the RNN has one ReLU layer and one softmax layer, and uses Cross Entropy loss as
its cost function. We follow the parse tree given from the leaf nodes up to the top of the tree and evaluate
the cost at each node. During backprop, we follow the exact opposite path. Figure 1 shows an example of
such a RNN applied to a simple sentence “I love this assignment”. These equations are sucient to explain
our model:
CE(y; ^y) = ?
X
i
yi log( ^ yi)
1
CS 224d: Assignment #3
where y is the one-hot label vector, and ^y is the predicted probability vector for all classes. In our case,
y 2 R5 and ^y 2 R5 to represent our 5 sentiment classes: Really Negative, Negative, Neutral, Positive, and
Really Positive. Furthermore,
h(1) = max(W(1)

h(1)
Left
h(1)
Right
#
+ b(1); 0)
^y = softmax(Uh(1) + b(s))
where h(1)
Left is the output of the layer beneath it on the left (and could be a word vector), the same for
h(1)
Right but coming from the right side. Assume Li 2 Rd, 8i 2 [1:::jV j], W(1) 2 Rd2d, b(1) 2 Rd, U 2 R5d,
b(s) 2 R5.
Figure 1: RNN (Recursive Neural Network) example
(a) Follow the example parse tree in Figure 1 in which we are given a parse tree and truth labels y for each
node. Starting with Node 1, then to Node 2, nishing with Node 3, write the update rules for W(1),b(1),
U, b(s), and L after the evaluation of ^y against our truth, y. This means for at each node, we evaluate:
3 = ^y ? y
as our rst error vector and we backpropogate that error through the network, aggregating gradient at
each node for:
Page 2 of 6
CS 224d: Assignment #3
@J
@U
@J
@b(s)
@J
@W(1)
@J
@b(1)
@J
@Li
Points will be deducted if you do not express the derivative of activation functions (ReLU) in terms of
their function values (as with Assignment 1 and 2) or do not express the gradients by using an \error
vector” (i) propagated back to each layer. Tip on notation: below and above should be used for error
that is being sent down to the next node, or came from an above node. This will help you think about
the problem in the right way. Note you should not be updating gradients for Li in Node 1. But error
should be leaving Node 1 for sure!
(b) Implementation time! Now that you have a feel for how to train this network:
 (a) Download, unzip, and have a look at the code base.
 (b) From the command line, run ./setup.sh. This should download the labeled parse tree dataset
and setup the environment for you (model folders, etc).
 (c) Now lets peruse the code base. You should start with runNNet.py to get a grasp for how the
network is run and tested. You should also take a peek at tree.py to understand what the Node
class is, and how we traverse a parse tree and what elds we update during forward pass and
backward pass (you need to know what hActs is!). Next, take a look at run.sh. This shell script
contains all the parameters needed to train the model. It is important to get this right. You should
update these environment parameters here and only here when you are ready.
 (d) Finally, open rnn.py. There are two functions left for you to implement, forwardProp and
backProp. Implement them!
 (e) When you are ready to test your implementation, run python rnn.py to perform gradient
check. You can make use of pdb and set trace as you code to give you insights into whats
happening under the hood! It is expected of you to take the time to fully understand how
this code base functions! This way you can perhaps use it as a starting point for your projects
(or any other codebase one might give you). Also, if you are unfamiliar with pdb.set trace,
de nitely take 5 minutes and learn about it!
(c) Test time! From the command line, run ./run.sh. This will train the model with the parameters you
speci ed in run.sh and produce a pickled neural network that we can test later via ./test.sh. Note
the training could take about an hour or more pending how many epochs you allow. Once the training
is done, you should open test.sh change the test data type to dev and run ./test.sh from the
command line. This should output your dev set accuracy.
Your task here is to produce four plots.
 (a) First, provide a plot showing the training error and dev error over epochs for a wvecdim=30.
You should see a point where the dev error actually begins to increase. If you do not, you probably
did not run for enough epochs. Note the number of epochs that is best on your dev set.
 (b) Provide a one sentence intuition why this happens.
 (c) Next, produce 2 confusion matrices on the above environment parameters (with your optimal
number of epochs), one for the training set, and one for the dev set. The function makeconf might
be handy. (Despite how much Ian hates Jet colorbars!)
A confusion matrix is a great way to see what you are getting wrong. In Figure 2, we see that the
model is very accurate for matching a 1 label with a 1. However, it confuses a 4 for a 0 very often.
You should take a second to make sense of this plot.
Page 3 of 6
CS 224d: Assignment #3
Figure 2: Confusion Matrix with truth down the y axis, and our models guess across the x axis
 (d) Finally, provide a plot of dev accuracy vs wvecdim with the same epochs as above. Reasonable
values for wvecdim would be 5, 15, 25, 35, 45. Note to generate all of this data for the plots, you
must train a number of models that will take some time, so it would NOT be wise to run one at a
time but rather, let many of them train over night on di erent prompts. These are the real pains
data scientists must deal with! If you are running this on myth, corn or your own server somewhere,
look up the linux command: screen. It will help tremendously.
2 2-Layer Deep RNN’s
Lets go deeper: Your advisor is impressed with your results. He mentions in your conversation that
perhaps the model is just not expressive enough. This gets you thinking. What if we add a layer in between
the rst layer and the softmax layer to help increase score accuracy!
Assume the same assumptions as in the rst RNN, but now we have one more layer such that W(2) 2
Rdmiddled, b(2) 2 Rdmiddle , and U 2 R5dmiddle.The equtions below should be sucient to explain the model.
h(1) = max(W(1)

h(1)
Left
h(1)
Right
#
+ b(1); 0)
h(2) = max(W(2)h(1) + b(2); 0)
^y = softmax(Uh(2) + b(s))
(a) Perform the same analysis for the example in Figure 3. The updates starting at Node 1, to Node 2, and
nally for Node 3 for:
@J
@U
@J
@b(s)
@J
@W(1)
@J
@b(1)
@J
@W(2)
@J
@b(2)
@J
@Li
Points will be deducted if you do not express the derivative of activation functions (ReLU) in terms of
their function values (as with Assignment 1 and 2) or do not express the gradients by using an \error
Page 4 of 6
CS 224d: Assignment #3
Figure 3: 2-Layer RNN example
vector” (i) propagated back to each layer. Tip on notation: below and above should be used for error
that is being sent down to the next node, or came from an above node. This will help you think about
the problem in the right way.
(b) Implementation time again!
 (a) First, open rnn2deep.py. There are two functions left for you to implement, forwardProp
and backProp.
 (b) When you are ready run python rnn2deep.py to run gradient check. You can make use of
pdb and set trace as you code to give you insights into whats happening under the hood!
 (c) Now, lets take a look at run.sh and the middleDim parameter. This is what you will use to
augment the size of the middle layer in your 2-Layer Network. Also, remember to update model
to be RNN2 instead of RNN.
(c) Test time! From the command line, run ./run.sh. This will train the model with the parameters you
speci ed in run.sh and produce a pickled neural network that we can test later. Note the training
could take about an hour or more. Once the training is done, you should open test.sh change the test
data type to dev and model to RNN2 and run ./test.sh. This should give you your dev set accuracy.
Your task here is to produce four more plots.
Page 5 of 6
CS 224d: Assignment #3
 (a) First, a plot showing the training error and dev error over epochs for a wvecdim=30 and
middleDim=30.
 (b) Second, a plot showing a confusion matrix on the train and another one on the dev sets using
the number of epochs from above. You might nd makeconf in runNNet.py useful.
 (c) Provide a two sentence intuition for why the model is doing better or worse than the rst RNN.
 (d) Next, provide a plot of dev accuracy vs middleDim. Reasonable values for middleDim would be
5, 15, 25, 35, 45 while wvecdim=30 and a constant number of epochs (found in part (c)(a) above).
Note to generate all of this data for the plots, you must train a number of models that will take
some time, so it would NOT be wise to run one at a time but rather, let many of them train over
night on di erent prompts.
(d) Suggest a change! Take a moment and observe the errors that your model is making. Does your model
do better on Negative scores? Positive scores? Suggest a change that would correct for that error. One
example change would be, add a “depth level index” to the model input making your input to the neural
network 2 R2d+1, so the model knows where in the tree it is. i.e. at the base level this index would be
0, at the next level it would be 1, etc, etc. Another example change could be to add ANOTHER layer.
What might that do for the model? A nal suggestion would be to make the error from above ow into
h(2) rather than h(1). This question is asking you to dig deep in the data, think outside the box and
come up with a scheme that will boost performance. A complete answer will be original, based on data
observation, and should have a reasonable expectation to boost performance.
(e) Extra Credit: Implement that change! Take your change you suggested, open rnn changed.py le,
and implement a new model called RNN3. Run the model and optimize it, as done in the previous
problems of this assignment. Report your ndings with a train and dev accuracy plot. And comment
on your ndings.
(f) Extra Credit: Implement Dropout on the softmax layer. Dropout is a regularization technique where we
randomly “drop” nodes in the Neural Network during training. When we apply our input, we can just
randomly select certain nodes to not re, by setting there output to 0 even though it might not have
been. We just need to remember at backprop time, NOT to update those weights. Hinton describes this
method as training 2N separate neural networks (if N is the number of nodes in the network), and we
take the average of the results. This method has been shown to improve neural network’s performance
by ensuring that each node is useful on its own. Apply this method on the RNN2 model and report your
results as you did in the previous portions of the assignment.
3 Extra Credit: Recursive Neural Tensor Networks
Derive the gradients and updates as done in part 1 and 2 of this assignment. Then, using the starter code
in rntn.py, implement the model detailed in Richard’s paper titled “Recursive Deep Models for Semantic
Compositionality Over a Sentiment Treebank”:
Socher, Richard, et al. “Recursive deep models for semantic compositionality over a sentiment treebank.”
Proceedings of the conference on empirical methods in natural language processing (EMNLP). Vol. 1631.
2013.
https://www-nlp.stanford.edu/pubs/SocherEtAl_EMNLP2013.pdf
Report your ndings in similar fashion to the previous 2 problems.
Page 6 of 6