Solved DS-GA-1011 HW2 – Machine Translation

$30.00

Original Work ?

Download Details:

  • Name: hw2-rmqt6w.zip
  • Type: zip
  • Size: 1.15 MB

Category: Tags: , , , You will Instantly receive a download link upon Payment||Click Original Work Button for Custom work

Description

5/5 - (1 vote)

1 Recurrent Neural Network
In this problem, you will show the problem of vanishing and exploding gradients for Recurrent Neural
Network (RNN) analytically. To show this, we will first expand the gradient of the loss function with respect
to the parameters using the chain rule. Then, we will bound the norm of each individual partial derivative

with matrix norm inequalities. The last step will be to collect all of the partial derivative terms and show
how repeated multiplication of a single weight matrix can lead to vanishing or exploding gradients.
1.1 RNN Derivatives
Let S = (s1, · · · , sT ) be a sequence of input word tokens and T be the sequence length. For a particular
token st ∈ V for 1 ≤ t ≤ T, we can obtain its corresponding word embedding xt ∈ R
d by applying equation
(1), where ϕone-hot is the one-hot encoding function and We is the word embedding matrix.
The RNN forward pass computes the hidden state ht ∈ R
d

using equation (2). Here Whh ∈ R
d
′×d

is the
recurrent weight matrix, Wih ∈ R
d
′×d
is the input-to-hidden weight matrix, bh ∈ R
d

is the hidden states
bias vector, and σ : R
d
′ → [−1, 1]d

is the tanh activation function. Whh, Wih, bh are shared across sequence
index t.
The output of RNN ot ∈ R
k at each sequence index t is given by equation (3), where Who ∈ R
k×d

is
the hidden-to-output weight matrix and bo ∈ R
k
is the output bias vector. For an input sequence S =
(s1, · · · , sT ), we have a corresponding sequence of RNN hidden states H = (h1, · · · , hT ) and outputs O =
(o1, · · · , oT ).
xt = Weϕone-hot(st) (1)
ht = σ(Whhht−1 + Wihxt + bh) (2)
ot = Who ht + bo (3)
Let’s now use this RNN model for classification. In particular, we consider the last output oT to be the
logits (scores for each class), which we then convert to the class probability vector pT ∈ [0, 1]k by pT =
g(Who hT + bo) where g(·) is the softmax function and ∥pT ∥1 = 1.
1. (1 point, written) Write down the per-example cross-entropy loss ℓ(y, pT ) for the classification task.
Here y ∈ {0, 1}
k
is a one-hot vector of the label and pT is the class probability vector where pT [i] =
p(y[i] = 1 | S) for i = 1, . . . , k. ([i] denotes the i-th entry of the corresponding vector.)

2. To perform backpropagation, we need to compute the derivative of the loss with respect to each
parameter. Without loss of generality, let’s consider the derivative with respect to a single parameter
w = Whh[i, j] where [i, j] denotes the (i, j)-th entry of the matrix. By chain rule, we have
∂ℓ
∂w =
∂ℓ
∂ot
∂ot
∂ht
∂ht
∂w (4)
Note that the first two derivatives in the 4 are easy to compute, so let’s focus on the last term ∂ht
∂w .
During the lecture, we have shown that
∂ht
∂w =
Xt
i=1
∂ht
∂hi
∂h+
i
∂w (5)
Here ∂h+
i
∂w denotes the “immediate” gradient where hi−1 is taken as a constant.
(a) (1 point, written) Give an expression for ∂h+
i
∂w .

(b) (2 points, written) Expand the gradient vector ∂ht
∂hi
using the chain rule as a product of partial
derivatives of one hidden state with respect to the previous hidden state. You do not need to
explicitly do differentiations beyond that.

3. (2 points, written) Now let’s further expand one of the partial derivatives from the previous question.
Write down the Jacobian matrix ∂hi+1
∂hi
by rules of differentiations. You can directly use σ
′ as the
derivative of the activateion function in the expression.

1.2 Bounding Gradient Norm
To determine if the gradient will vanish or explode, we need a notion of magnitude. For the Jacobian
matrix, we can use the induced matrix norm (or operator norm). For this question, we use the spectral
norm ∥A∥2 =
p
λmax(A⊤A) = smax(A) for a matrix A ∈ R
m×n. Here λmax(A⊤A) denotes the maximum
eigenvalue of the matrix A⊤A and smax(A) denotes the maximum singular value of the matrix A. You can
learn more about matrix norms at this Wikipedia entry.
Now, to determine if the gradient ∂ℓ
∂w will vanish or explode, we can focus on ∥
∂ht
∂hi
∥. Note that if ∥
∂ht
∂hi

vanishes or explodes, ∥
∂ℓ
∂w ∥ also vanishes or explodes based on (4) and (5).
1. (2 points, written) Given the mathematical form of the Jacobian matrix ∂hi+1
∂hi
we derived earlier, we
can now bound the norm of the Jacobian with the following matrix norm inequality
∥AB∥2 ≤ ∥A∥2 · ∥B∥2 (6)
for matrices A, B with matched shapes. Write down a bound for

∂hi
∂hi−1

2
.

2. (4 points, written) Now we have all the pieces we need. Derive a bound on the gradient norm ∥
∂ht
∂hi
∥.
Explain how the magnitude of the maximum singular value of Whh can lead to either vanishing or
exploding gradient problems. [HINT: You can use the fact that for the tanh activation function σ(·),
the derivative σ

(·) is always less than or equal to 1.]

3. (1 point, written) Propose one way to get around the vanishing and exploding gradient problem.

2 Machine Translation
The goal of this homework is to build a machine translation system using sequence-to-sequence transformer
models https://arxiv.org/abs/1706.03762. More specifically, you will build a system which translates
German to English using the Multi30k dataset (https://arxiv.org/abs/1605.00459) You are provided
with a code skeleton, which clearly marks out where you need to fill in code for each sub-question.
First go through the file README.md to set up the environment required for the class.
2.1 Attention
Transformers use scaled dot-product attention — given a set of queries Q (each of dimension dk), a set of
keys K (also each dimension dk), and a set of values V (each of dimension dv), the output is a weighted sum
of the values. More specifically,
Attention(Q, K, V ) = softmax(QKT

dk
)V (7)
Note that each of Q, K, V is a matrix of vectors packed together.
1. (2 points, written) The above function is called ’scaled’ attention due to the scaling factor √
1
dk
. The
original transformers paper mentions that this is needed because dot products between keys and queries
get large with larger dk.
For a query q and key k both of dimension dk and each component being an independent random
variable with mean 0 and variance 1, compute the mean and variance (with steps) of the dot product
q.k to demonstrate the point.

2. (2 points, coding) Implement the above scaled dot-product attention in the attention() function
present in layers.py. You can test the implementation after the next part.

3. (2 point, coding) In this part, you will modify the attention() function by making use of the parameters mask and dropout which were input to the function. The mask indicates positions where the
attention values should be zero (e.g. when we have padded a sentence of length 5 to length 10, we
do not want to attend to the extra tokens). dropout should be applied to the attention weights for
regularization.
To test the implementation against some unit tests, run python3 test.py –attention.

4. (3 points, coding) Instead of a single attention function, transformers use multi-headed attention function. For original keys, queries and values (each of dimension say dmodel), we use h different projection
matrices to obtain queries, keys and values of dimensions dk, dk and dv respectively. Implement the
function MultiHeadedAttention() in layers.py. To test the implementation against some unit tests,
run python3 test.py –multiheaded attention.

2.2 Positional Encoding
Since the underlying blocks in a transformer (namely attention and feed forward layers) do not encode any
information about the order of the input tokens, transformers use ‘positional encodings’ which are added to
the input embeddings. If dmodel is the dimension of the input embeddings, pos is the position, and i is the
dimension, then the encoding is defined as:
P E(pos,2i) = sin(pos/100002i/dmodel ) (8)
P E(pos,2i+1) = cos(pos/100002i/dmodel ) (9)
1. (2 points, written) Since the objective of the positional encoding is to add information about the
position, can we simply use P Epos = sin(pos) as the positional encoding for pos position? Why or
why not?

2. (2 points, coding) Implement the above positional encoding in the function PositionalEncoding()
in the file utils.py. To test the implementation against some unit tests, run python3 test.py
–positional encoding.

2.3 Training
1. (2 points, written) The above questions should complete the missing parts in the training code and we
can now train a machine translation system!
Use the command python3 main.py to train your model. For the purpose of this homework, you
are not required to tune any hyperparameters. You should submit the generated out greedy.txt file
containing outputs. You must obtain a BLEU score of atleast 35 for full points (By default we are
using BLEU-4 for this and all subsequent questions).

2.4 Decoding & Evaluation
In the previous question, the code uses the default greedy decode() to decode the output. In practice,
people use algorithms such as beam search decoding, which have been shown to give better quality outputs.
(Note: In the following questions, use a model trained with the default i.e. given hyperparameters)
1. (2 points, written) In the file utils.py you will notice a function subsequent mask(). What does
that function do and why is it required in our model?

2. (5 points, coding) Implement the beam search() function in utils.py. We have provided the main
skeleton for this function and you are only required to fill in the important parts (more details in
the code). You can run the code using the arguments –beam search and –beam size. You should
submit the generated file out beam.txt when beam size = 2.
To test the implementation against some unit tests, run python3 test.py –beam search.

3. (3 points, written) For the model trained in question 1.3, plot the BLEU score as a function of beam
size. You should plot the output from beam size 1 to 5. Is the trend as expected? Explain your answer.

4. (2 points, written) You might notice that some of the sentences contain the ‘⟨unk⟩’ token which denotes
a word not in the vocabulary. For systems such as Google Translate, you might not want such tokens
in the outputs seen by the user. Describe a potential way to avoid (or reduce) the occurrence of these
tokens in the output.

5. (2 points, written) In this homework, you have been using BLEU score as your evaluation metric.
Consider an example where the reference translation is ”I just went to the mall to buy a table.”, and
consider two possible model generations: ”I just went to the mall to buy a knife.” and ”I just went to
the mall to buy a desk.”. Which one will BLEU score higher? Suggest a potential fix if it does not
score the intended one higher.