CSCE 636: Deep Learning Assignment #4 solution

$30.00

Category:

Description

Rate this product

1. (100 points)(Coding Task) In this assignment, you will implement a recurrent neural network
(RNN) for language modeling using Pytorch. The task is to predict word xt+1 given words
x1, . . . , xt
:
P(xt+1 = vj |xt
, . . . , x1)
where vj is the j-th word in the vocabulary. The file“utils.py” gives an example of how to
generate the vocabulary. You can read it if interested. With the vocabulary, we can transform
a word xi
into a one-hot vector.
Our RNN model is, for t = 1, . . . , n − 1:
e
(t) = x
(t)L,
h
(t) = sigmoid(h
(t−1)H + e
(t)
I + b1),

(t) = softmax(h
(t)U + b2),
P¯(xt+1 = vj |xt
, . . . , x1) = ˆy
(t)
j
.
where the first line actually corresponds to a word embedding lookup operation. h
(0) is the
initial hidden state, ˆy
(t) ∈ R
|V | and its j-th entry is ˆy
(t)
j
.
Training parameters θ in this model are:
1
· L – embedding matrix which transforms words in the vocabulary into lower dimensional
word embedding.
· H – hidden transformation matrix.
· I – input transformation matrix which takes word embedding as input.
· U – output transformation matrix which projects hidden state into prediction vector.
· b1 – bias for recurrent layer.
· b2 – bias for projection layer.
(a) (10 points) Let the dimension of word embedding as d, the size of vocabulary as |V |, the
number of hidden units as D, please provide the size of each training parameter above.
(b) (30 points) To train the model, we use cross-entropy loss. For time step t (note that for
language model, we have loss for every time step), we have:
E
(t)
(θ) = CE(y
(t)
, yˆ
(t)
) = −
X
|V |
j=1
y
(t)
j
log(ˆy
(t)
j
),
where y
(t)
is the one-hot vector corresponding to the target word, which here is equal to
x
(t+1). For a sequence, we sum and average the loss of every time step.
PyTorch does not require the implementation of back-propagation, but you should
know the details. Compute the following gradients at a single time step t:
∂E(t)
∂U ,
∂E(t)
∂b2
,
∂E(t)
∂I

(t)
,
∂E(t)
∂H

(t)
,
∂E(t)
∂b1

(t)
,
∂E(t)
∂h(t−1)
where |(t) denotes the gradient with respect to time step t only. Note that we have
weight sharing in recurrent layer, so in practice, the back-propagation would update the
parameters according to the gradients across all time steps. Hint:
∂E(t)
∂U = (h
(t)
)
T
(ˆy
(t) − y
(t)
)
Make sure you understand this hint and you can use it directly in your answer.
(c) (10 points) To evaluate a language model, we use perplexity, which is defined as the
inverse probability of the target word according to the model prediction P¯:
P P(t)
(y
(T)
, yˆ
(t)
) = 1
P¯(x
pred
t+1 = xt+1|xt
, . . . , x1)
=
1
P|V |
j=1 y
(t)
j

(t)
j
.
Show the relationship between cross-entropy and perplexity.
(d) (50 points) Read the starting code very carefully and implement the above model by
completing RNNLM.py. You only need to code some parts related to the RNN model,
such as the architecture, loss function. Follow the instructions in the starting code to
understand which parts need to be filled in. When you are done, run python RNNLM.py.
The starting code supports GPU computation and also provides guidelines on how to
support CPU. It is very straightforward but note that CPU computation is slow! You
should change the hyperparameters to explore the best configurations. Submit the code
2
with your best hyperparameters. Also report the best testing perplexity score. This
model is a generative model. At the end of the run, it uses the trained language model
to generate sentences with the start words you provide. Be creative and report any
results that you think are interesting. Please summarize your results and observations.
2. (30 points) As introduced in class, the attention mechanism can be written into:
Attention(Q, K, V ) = softmax(QKT
)V.
By adding linear transformations on Q, K, and V , it turns into:
Attention(QWQ, KWK, V WV
) = softmax
QWQ(KWK)
T

V WV
.
Here, we set Q ∈ R
n×d
, WQ ∈ R
d×d
, K ∈ R
n×d
, WK ∈ R
d×d
, V ∈ R
n×d
, WV ∈ R
d×d
.
We’ve also covered the multi-head attention:
MultiHead(Q, K, V ) = Concat(head1, . . . , headh),
where
headi = Attention(QWQ
i
, KWK
i
, V WV
i
), i = 1, . . . , h.
Here, Q, K, V are the same as defined above. We set W
Q
i ∈ R
d× d
h , WK
i ∈ R
d× d
h , WV
i ∈ R
d× d
h .
(a) (15 points) Compute and compare the number of parameters between the single-head
and multi-head attention.
(b) (15 points) Compute and compare the amount of computation between the single-head
and multi-head attention, including the softmax step. Use the big-O notation to show
your results.
(Hint1: For (b), what we talked about in class may not be precise.)
(Hint2: Quoted from the paper (https://arxiv.org/pdf/1706.03762.pdf), “Due to
the reduced dimension of each head, the total computational cost is similar to that of
single-head attention with full dimensionality.”)
3. (20 points) For deep learning on graph data, we have learned GCNs, which aggregate information from neighboring nodes. The layer-wise forward-propagation operation of GCNs can
be expressed as
Xl+1 = σ(AXlWl
),
where Xl and Xl+1 are the input and output matrices of layer l, respectively, and A is the
adjacency matrix.
(a) (10 points) One limitation of this simple model is that multiplication with A means each
center node sums up feature vectors of all neighboring nodes but not the node itself.
How to fix this limitation with a simple modification on A?
(b) (10 points) Another limitation is that A is not normalized. The multiplication of A will
change the scale of feature vectors. Suppose we want to normalize A such that all rows
sum to one. How to fix this limitation with a simple modification on A?