Computer Science 401 Homework Assignment #2 Neural Machine Translation solution

$29.99

Original Work ?
Category: You will Instantly receive a download link for .ZIP solution file upon Payment

Description

5/5 - (5 votes)

Overview 1.1 Canadian Hansards The main corpus for this assignment comes from the official records (Hansards) of the 36th Canadian Parliament, including debates from both the House of Representatives and the Senate. This corpus is available at /u/cs401/A2/data/Hansard/ and has been split into Training/ and Testing/ directories. This data set consists of pairs of corresponding files (*.e is the English equivalent of the French *.f) in which every line is a sentence. Here, sentence alignment has already been performed for you. That is, the n th sentence in one file corresponds to the n th sentence in its corresponding file (e.g., line n in fubar.e is aligned with line n in fubar.f). Note that this data only consists of sentence pairs; many-to-one, many-to-many, and one-to-many alignments are not included. 1.2 Seq2seq We will be implementing a simple seq2seq model, without attention, with single-headed attention, and with multi-headed attention based largely on the course material. You will train the models with teacherforcing and decode using beam search. We will write it in PyTorch version 1.9.1 (https://pytorch.org/ docs/1.9.1/), and Python version 3.9.7, which are the versions installed on the teach.cs servers. For those unfamiliar with PyTorch, we suggest you first read the PyTorch tutorial (https://pytorch.org/ tutorials/beginner/deep_learning_60min_blitz.html). 1.3 Tensors and batches PyTorch, like many deep learning frameworks, operate with tensors, which are multi-dimensional arrays. When you work in PyTorch, you will rarely if ever work with just one bitext pair at a time. You’ll instead be working with multiple sequences in one tensor, organized along one dimension of the batch. This means that a pair of source and target tensors F and E actually correspond to multiple sequences F = (F (m) 1:S(m) )m∈[1,M] , E = (E (m) 1:T(m) )m∈[1,M] . We work with batches instead of individual sequences because: a) backpropagating the average gradient over a batch tends to converge faster than single samples, and b) sample computations can be performed in parallel. For example, if we want to multiply source sequences F (m) and F (m+1) with an embedding matrix W, we can tell one CPU core to compute the result for F (m) and another for F (m+1), halving the overall time it would take to multiply them independently. Learning to work with tensors can be difficult at first, but is integral to efficient computation. We suggest you read more about it in the NumPy docs (https://numpy.org/doc/stable/user/basics.broadcasting.html), which PyTorch borrows for tensors. Copyright © 2022 University of Toronto. All rights reserved. 1 1.4 Differences from the lectures There are three changes to the seq2seq architectures that we make for this assignment. First, instead of scaled dot-product attention scores score(a, b) = |a| −1/2 ⟨a, b⟩, we’ll use the cosine similarity between vectors a and b: score(a, b) = ⟨a, b⟩ max (∥a∥2 ∥b∥2 , ε) Where 0 < ε ≪ 1 ensures score(a, b) = 0 when a = 0 or b = 0. The second relates to how we calculate the first hidden state for the decoder when we don’t use attention. Recall that a bidirectional recurrent architecture processes its input in both directions separately: the forward direction processes (x1, x2, . . . , xS) whereas the backward direction processes (xS, xS−1, . . . , x1). The bidirectional hidden state concatenates the forward and backward hidden states for the same time ht = [h f orward t , hbackward t ]. This implies hS has processed all the input in the forward direction, but only one input in the backward direction (and vice versa for h1). To ensure the decoder gets access to all input from both directions, you should initialize the first decoder state h˜ 1 = [h f orward S , hbackward 1 ] When you use attention, set h˜ 1 = 0. Our final “change” isn’t so much a change as a clarification. In multi-headed attention, recall we have N heads such that h˜ (n) t = W˜ (n)h˜ t and h (n) s = W(n)hs. W(n) and W˜ (n) need not be square matrices; the size of h˜ (n) t need not be the size of h˜ t nor h (n) s the size of hs. For this assignment, we will be setting the size of |h˜ (n) t | = |h˜ t |/N and |h (n) s | = |hs|/N (you may assume N evenly divides the hidden state size). 2 Your tasks 2.1 Setup You are expected to run your solutions on teach.cs. Download the starter code from MarkUs. You can download the starter files by clicking the “Download” button in the “Starter Files” section of the Assignment page. The Hansards parallel text data are located in the /h/u1/cs401/A2/data/ directory on teach.cs. You don’t need to copy the text data to your own teach.cs directory. Download the starter code MarkUs into your working directory to get started. You should have 8 files: a2 abcs.py, a2 bleu score.py, a2 encoder decoder.py, a2 dataloader.py, a2 run.py, a2 training and testing.py, test a2 bleu score.py, and test a2 encoder decoder.py. 2.2 Calculating BLEU scores Modify a2 bleu score.py to be able to calculate BLEU scores on single reference and candidate strings. We will be using the definition of BLEU scores from the lecture slides: BLEU = BPC × (p1p2 . . . pn) (1/n) To do this, you will need to implement the functions grouper(…), n gram precision(…), brevity penalty(…), and BLEU score(…). Make sure to carefully follow the doc strings of each function. Do not re-implement functionality that is clearly performed by some other function. Your functions will operate on sequences (e.g., lists) of tokens. These tokens could be the words themselves (strings) or an integer ID corresponding to the words. Your code should be agnostic to the type of token used, though you can assume that both the reference and candidate sequences will use tokens of the same type. 2 2.3 Building the encoder/decoder You are expected to fill out a number of methods in a2 encoder decoder.py. These methods belong to sub-classes of the abstract base classes in a2 abcs.py. The latter defines the abstract classes EncoderBase, DecoderBase, and EncoderDecoderBase, which implement much of the boilerplate code necessary to get a seq2seq model up and running. Though you are welcome to read and understand this code, it is not necessary to do so for this assignment. You will, however, need to read the doc strings in a2 abcs.py to understand what you’re supposed to fill out in a2 encoder decoder.py. Do not modify any of the code in a2 abcs.py. A high-level description of the contents of the requirements for a2 encoder decoder.py follows here. More details can be found in the doc strings in a2 {abcs,encoder decoder}.py. 2.3.1 Encoder a2 encoder decoder.Encoder will be the concrete implementation of all encoders you will use. The encoder is always a multi-layer neural network with a bidirectional recurrent architecture. The encoder gets a batch of source sequences as input and outputs the corresponding sequence of hidden states from the last recurrent layer. Encoder.forward pass defines the structure of the encoder. For every model in PyTorch, the forward function defines how the model will run, and the forward function of every encoder or decoder will first clean up your input data and call forward pass to actually define the model structure. Now you need to implement the forward pass function that defines how your encoder will run. Encoder.init submodules(…) should be filled out to initialize a word embedding layer and a recurrent network architecture. Encoder.get all rnn inputs(…) accepts a batch of source sequences F (m) 1:S(m) and lengths S (m) and outputs word embeddings for the sequences x (m) 1:S(m) . Encoder.get all hidden states(…) converts the word embeddings x (m) 1:S(m) into hidden states for the last layer of the RNN h (m) 1:S(m) (note we’re using (m) here for the batch index, not the layer index). 2.3.2 Decoder without attention a2 encoder decoder.DecoderWithoutAttention will be the concrete implementation of the decoders that do not use attention (so-called “transducer” models). Method implementations should thus be tailored to not use attention. In order to feed the previous output into the decoder as input, the decoder can only process one step of input at a time and produce one output. Thus DecoderWithoutAttention is designed to process one slice of input at a time (though it will still be a batch of input for that given slice of time). The goal, then, is to take some target slice from the previous time step E (m) t−1 and produce an un-normalized log-probability distribution over target words at time step t, called logits(m) t . Logits can be converted to a categorical distribution using a softmax: P(y (m) t = i| . . .) = exp(logits(m) t,i ) P j exp(logits(m) t,j ) DecoderWithoutAttention.forward pass defines the structure of network. Similar to what you did for your encoder, you need to assemble the model here. DecoderWithoutAttention.init submodules(…) should be filled out to initialize a word embedding layer, a recurrent cell, and a feed-forward layer to convert the hidden state to logits. DecoderWithoutAttention.get first hidden state(…) produces h˜ (m) 1 given the encoder hidden states h (m) 1:S(m) . 3 DecoderWithoutAttention.get current rnn input(…) takes the previous target E (m) t−1 (or the previous output y (m) t−1 in testing) and outputs word embedding ˜x (m) t for the current step. DecoderWithoutAttention.get current hidden state(…) takes ˜x (m) t and h˜ (m) t−1 and produces the current decoder hidden state h˜ (m) t−1 . DecoderWithoutAttention.get current logits(…) takes h˜ (m) t and produces logits(m) t . 2.3.3 Decoder with (single-headed) attention a2 encoder decoder.DecoderWithAttention will be the concrete implementation of the decoders that use single-headed attention. It inherits from DecoderWithoutAttention to avoid re-implementing get current hidden states(…) and get current logits(…). The remaining methods must be reimplemented, but slightly modified for the attention context. Two new methods must be implemented for DecoderWithAttention. DecoderWithAttention.get attention scores(…) takes in a decoder state h˜ (m) t and all encoder hidden states h (m) 1:S(m) and produces attention scores for that decoder state but all encoder hidden states: e (m) t,1:S(m) . DecoderWithAttention.attend(…) takes in a decoder state h˜ (m) t and all encoder hidden states h (m) 1:S(m) and produces the attention context vector c (m) t . Between get attention scores and attend, use get attention weights(…) to convert e (m) t,1:S(m) to α (m) t,1:S(m) , which has been implemented for you. 2.3.4 Decoder with multi-head attention a2 encoder decoder.DecoderWithMultiHeadAttention implements a multi-headed variant of attention. It inherits from DecoderWithAttention. Two methods must be re-implemented for this variant. DecoderWithMultiHeadAttention.init submodules(…) should initialize new submodules for the matrices W, W˜ , and Q. DecoderWithMultiHeadAttention.attend(…) should “split” hidden states h˜ (m) t into h˜ (m,n) t and h (m) s into h (m,n) s , where m still indexes the batch number and n indexes the head. Then it should call super().attend(…) to do the attention, and combine c (m,n) t of the N heads. We want you to do this without ever actually “splitting” any tensors! The key is to reshape the a full hidden output into N chunks. If it’s a little bit too tricky, try starting from writing the case where N = 1, i.e. when there’s no need for splitting. 2.3.5 Putting it together: the Encoder/Decoder a2 encoder decoder.EncoderDecoder coordinates the encoder and decoder. Its behaviour depends on whether it’s being used for training or testing. In training, it receives both F (m) 1:S(m) and E (m) 1:T(m) and outputs logits(m) 1:T(m) un-normalized log-probabilities over y (m) 1:T(m) . In testing, it receives only F (m) 1:S(m) and outputs K paths from beam search per batch element n: y (n,k) 1:T(n,k) . EncoderDecoder.init submodules(…) initializes the encoder and decoder. EncoderDecoder.get logits for teacher forcing(…) provides you the encoder output h (m) 1:S(m) and the targets E (m) 1:T(m) and asks you to derive logits(m) 1:T(m) according to the MLE (teacher-forcing) objective. EncoderDecoder.update beam(…) asks you to handle one iteration of a simplified version of the beam search from the slides. While a proper beam search requires you to handle the set of finished paths 4 f, update beam doesn’t need to. Letting (n, k) indicate the n th batch elements’ k th path: ∀n, k, v.b(n,k→v) t,0 ← h˜ (n,k) t+1 b (n,k→v) t,1 ← [b (n,k) t,1 , v] log P(b (n,k→v) t ) ← log P(b (n,k) t ) + log P(yt+1 = v|h˜ (n,k) t+1 ) ∀n, k.b(n,k) t+1 ← argmaxk b (n,k′→v) t log P(b (n,k′→v) t ) In short, extend the existing paths, then prune back to the beam width. A greedy update function update greedy is provided for you in a2 abcs.py. You can use the option –greedy to switch to greedy update. This option might be handy when you want to test the correctness of the rest of your assignment. 2.3.6 Padding An important detail when dealing with sequences of input and output is how to deal with sequence lengths. Individual sequences within a batch F (m) and E(m) can have unequal lengths S (m) ̸= S (n+1) , T (m) ̸= T (n+1) , but we pad the shorter sequences to the right to match the longest sequence. This allows us to parallelize across multiple sequences, but it’s important that whatever the network learns (i.e., the error signal) is not impacted by padding. We’ve mostly handled this for you in the functions we’ve implemented, with three exceptions: first, no word embedding should be learned for padding (which you’ll have to guarantee); second, you’ll have to ensure the bidirectional encoder doesn’t process the padding; and third, the first hidden state of the decoder (without attention) should not be based on padded hidden states. You are given plenty of warning in the starter code when these three cases ought to be considered. The decoder uses the end-of-sequence symbol as padding, which is entirely handled in a2 training and testing.py. 2.4 The training and testing loops After following the PyTorch tutorial, you should be familiar with how models are trained and tested in PyTorch. You are expected to implement training and testing loops in a2 training and testing.py. In a2 training and testing.compute batch total bleu(…), you are given reference (from the dataset) and candidate batches (from the model) in the target language and asked to compute the total BLEU score over the batch. You will have to convert the PyTorch tensors in order to use a2 bleu score.BLEU score(…). In a2 training and testing.compute average bleu over dataset(…), you are to follow instructions in the doc string and use compute batch total bleu(…) to determine the average BLEU score over a data set. In a2 training and testing.train for epoch(…), once again follow the doc strings to iterate through a training data set and update model parameters using gradient descent. 2.5 Running the models Once you have completed the coding portion of the assignment, it is time you run your models. In order to do so in a reasonable amount of time, you’ll have to train your models using a machine with a GPU. There are a few ways you can do this: 1. You can ssh to teach.cs and use srun to run your code on a GPU on the department’s cluster. See more details below about how to use srun. 5 2. A number of teaching labs in the Bahen building have GPUs (listed in https://www.teach.cs. toronto.edu/faq.html#ABOUT4), but you must log in at the physical machines to use them (as opposed to remote access). 3. If you have access to your own GPU, you may run this code locally and report the results. However, any modifications you make to run the code locally must be reverted to work on teach before you submit! Even on a GPU, the code can take upwards of 2 hours to complete in full. Be sure to plan accordingly! You are going to interface with your models using the script a2 run.py. This script glues together the components you implemented previously. The only meaningful remaining code is in a2 dataloader.py, which converts the Hansard sentences into sequences of IDs. Suffice to say that you not need to know how either a2 run.py nor a2 dataloader.py works, only use them (unless you are interested). Run the following code block line-by-line from your working directory. In order, it: 1. Builds maps between words and unique numerical identifiers for each language. 2. Splits the training data into a portion to train on and a hold-out portion. 3. Trains the encoder/decoder without attention and stores the model parameters. 4. Trains the encoder/decoder with single-headed attention and stores the model parameters. 5. Trains the encoder/decoder with multi-headed-attention and stores the model parameters. 6. Returns the average BLEU score of the encoder/decoder without attention on the test set. 7. Returns the average BLEU score of the encoder/decoder with single-headed attention on the test set. 8. Returns the average BLEU score of the encoder/decoder with multi-headed attention on the test set. export TRAIN=/h/u1/cs401/A2/data/Hansard/Training/ export TEST=/h/u1/cs401/A2/data/Hansard/Testing/ # 1. Generate vocabularies python3.9 a2_run.py vocab $TRAIN e vocab.e.gz python3.9 a2_run.py vocab $TRAIN f vocab.f.gz # 2. Split train and dev sets python3.9 a2_run.py split $TRAIN train.txt.gz dev.txt.gz # 3. Train a model without attention srun -p csc401 –gres gpu \ python3.9 a2_run.py train $TRAIN \ vocab.e.gz vocab.f.gz \ train.txt.gz dev.txt.gz \ model_wo_att.pt.gz \ –device cuda # 4. Train a model with attention srun -p csc401 –gres gpu \ python3.9 a2_run.py train $TRAIN \ vocab.e.gz vocab.f.gz \ 6 train.txt.gz dev.txt.gz \ model_w_att.pt.gz \ –with-attention \ –device cuda # 5. Train a model with multi-head attention srun -p csc401 –gres gpu \ python3.9 a2_run.py train $TRAIN \ vocab.e.gz vocab.f.gz \ train.txt.gz dev.txt.gz \ model_w_mhatt.pt.gz \ –with-multihead-attention \ –device cuda # 6. Test the model without attention srun -p csc401 –gres gpu \ python3.9 a2_run.py test $TEST \ vocab.e.gz vocab.f.gz model_wo_att.pt.gz \ –device cuda # 7. Test the model with attention srun -p csc401 –gres gpu \ python3.9 a2_run.py test $TEST \ vocab.e.gz vocab.f.gz model_w_att.pt.gz \ –with-attention –device cuda # 8. Test the model with multi-head attention srun -p csc401 –gres gpu \ python3.9 a2_run.py test $TEST \ vocab.e.gz vocab.f.gz model_w_mhatt.pt.gz \ –with-multihead-attention –device cuda Steps 1 and 2 should not fail and need only be run once. Step 3 onward depend on the correctness of your code. The srun -p csc401 –gres gpu is necessary to run on a GPU on teach. You do not need a GPU for the first two steps. If you are running the training/testing locally, you will not need srun -p csc401 when running steps 3-6. The srun prefix is only needed when running on the (remote) teach server that uses SLURM1 to schedule processes on the department’s cluster. Because all students in this class or any class requiring GPUs will be running their jobs on the cluster please only run steps 3-6 after you have debugged your code. We discuss how you can train a smaller network below that will only take a fraction of the time anyway. In a file called analysis.txt, provide the following: • The printout after every epoch of the training loop of both the model for the model trained without, with single-headed, and with multi-headed attention. Clearly indicate which is which. • The average BLEU score reported on the test set for each model. Again, clearly indicate which is which. 1 https://en.wikipedia.org/wiki/Slurm_Workload_Manager 7 • A brief discussion on your findings. Was there a discrepancy in between training and testing results? Why do you think that is? If one model did better than the others, why do you think that is? 2.6 Bonus [up to 15 marks] We will give bonus marks for innovative work going substantially beyond the minimal requirements. However, your overall mark for this assignment cannot exceed 100%. Submit your write-up in bonus.pdf. You may decide to pursue any number of tasks of your own design related to this assignment, although you should consult with the instructor or the TA before embarking on such exploration. Certainly, the rest of the assignment takes higher priority. Some ideas: • Perform substantial data analysis of the error trends observed in each method you implement. This must go well beyond the basic discussion already included in the assignment. • There are many possible ways to assemble attentions for the encoder / decoder. For example, dotproduct attention similar to (Vaswani et al., 2017)2 , additive attention (Bahdanau et al., 2014)3 and structured self attention (Lin et al., 2017)4 . Analyze several attention mechanisms, compare their performances, and include discussions towards their reasons of the performance differences. • Explore the effects of using different ‘attention score function’ for computing attention score as discussed above and include attention visualization of the different attention functions. 2 https://arxiv.org/abs/1706.03762 3 https://arxiv.org/abs/1409.0473 4 https://arxiv.org/abs/1703.03130 8 3 Submission requirements This assignment is submitted electronically. Submit your assignment on MarkUs. Do not tar or compress your files, and do not place your files in subdirectories. You should submit: 1. The files a2 bleu score.py, a2 encoder decoder.py, and a2 training and testing.py that you filled out according to assignment specifications. We will not accept a2 abc.py, a2 dataloader.py, nor a2 run.py – your assignment must be compatible with the versions we provided you. 2. Your write-up on the experiment in analysis.txt. 3. If you are submitting a bonus, tell us what you’ve done by submitting a write-up in bonus.pdf. Please distribute bonus code amongst the above *.py files, being careful not to break functions, methods, and classes related to the assignment requirements. You should not submit any additional files that you generated to train and test your models. For example, do not submit your model parameter files *.pt.gz or vocab files *.{e,f}.gz. Only submit the above files. Additional source files containing helper functions are not permitted. 9 A Suggestions A.1 Check Piazza regularly Updates to this assignment as well as additional assistance outside tutorials will be primarily distributed via Piazza (https://piazza.com/class/kx9cacio1zf5rh). It is your responsibility to check Piazza regularly for updates. A.2 Run cluster code early and at irregular times Because GPU resources are shared with your peers, your srun job may end up on a weaker machine (or even postponed until the resources are available) if too many students are training at once. To help balance resource usage over time, we recommend you finish this assignment as early as possible. You might find that your peers are more likely to run code at certain times in the day. To check how many jobs are currently queued or running on our partition, please run squeue -p csc401. If you decide to run your models right before the assignment deadline, please be aware that we will be unable to request more resources or run your code sooner. We will not grant extensions for this reason. By the way, if you end up scheduling your job on one of the slower machines, you might end up with slightly different results than on the fastest ones. This is an unfortunate but normal by-product of how the GPUs calculate results in slightly different ways. Your BLEU score should not differ by more than 3% absolute. A.3 Connection persistence: keep training after disconnecting Training a model can take hours. If your internet connection is weak, you run the risk of losing your progress. You can use the Linux screen (https://linux.die.net/man/1/screen) command to create a persistent shell that is only destroyed when you exit from it. This will allow training to continue even if disconnected. The most basic usage of screen is as follows. To start a new shell, run the command screen. You should see the same shell interface as before (i.e. wolf:$) but you are now in a new shell instance. You can type Ctrl+D or the command exit to kill the screen, or type Ctrl+A followed by Ctrl+D to detach from the screen. A detached screen persists between SSH sessions. If you want to reconnect to your shell, try the command screen -r. Tmux users can find the ‘Tmux cheatsheet’ at https://tmuxcheatsheet.com/ helpful. A.4 Using your own computer If you want to do some or all of this assignment on your laptop or other computer, you will have to do the extra work of downloading and installing the requisite software and data. You take on the risk that your computer might not be adequate for the task. You are strongly advised to upload regular backups 10 of your work to teach.cs, so that if your machine fails or proves to be inadequate, you can immediately continue working on the assignment at teach.cs. When you have completed the assignment, you should try your programs out on teach.cs to make sure that they run correctly there. A submission that does not work on teach.cs will get zero marks. That said, due to concerns of limited resources, we will allow you to report the results of training/testing your model on your local machine. The code must still conform to the teach.cs environment, but the contents of analysis.txt can be based on your local environment. A.5 Unit testing We strongly recommend you test the methods and functions you’ve implemented prior to running the training loop. You can test actual output against expected input for complex methods like update beam(…). The Python 3.7 teach environment has installed pytest (https://docs.pytest.org/en/5.1.2/). We have included some preliminary tests for BLEU score(…) and update beam(…). You can run your test suite on teach by calling python3.9 -m pytest While the test suite will execute initially, the provided tests will fail until you implement the necessary methods and functions. While passing these initial tests is a necessity for full marks, they are not sufficient on their own. Please be sure to add your own tests. Unit testing is not a requirement, nor will you receive bonus marks for it. A.6 Debugging task Instead of re-running the entire task on GPU when debugging your code, we recommend that you run a much smaller version of the task until you are confident that you are error free. Ideally, you should only need to run the full task once. The following commands may be used to set up such a task. # Variables TRAIN and TEST as before. export OMP_NUM_THREADS=4 # avoids a libgomp error on teach # create an input and output vocabulary of only 100 words python3.9 a2_run.py vocab $TRAIN e vocab_tiny.e.gz –max-vocab 100 python3.9 a2_run.py vocab $TRAIN f vocab_tiny.f.gz –max-vocab 100 # only use the proceedings of 4 meetings, 3 for training and 1 for dev python3.9 a2_run.py split $TRAIN train_tiny.txt.gz dev_tiny.txt.gz –limit 4 # use far fewer parameters in your model python3.9 a2_run.py train $TRAIN \ vocab_tiny.e.gz vocab_tiny.f.gz \ train_tiny.txt.gz dev_tiny.txt.gz \ model.pt.gz \ 11 –epochs 2 \ –word-embedding-size 51 \ –encoder-hidden-size 101 \ –batch-size 5 \ –cell-type gru \ –beam-width 2 # use with the flags –with-attention and –with-multihead-attention to test single- # and multi-headed attention, respectively. We request that you first see if running directly on the teach.cs server (wolf) runs quickly enough before attempting to use the cluster. Tested at low occupancy, each epoch took about 30 seconds with welloptimized code directly on wolf. At high occupancy, each epoch took 4 minutes. Note that your BLEU will likely be high in the reduced vocabulary condition – even using very few parameters – since your model will end up learning to output the out-of-vocabulary symbol. Do not report your findings on the toy task in analysis.txt. A.7 Beam search not finished warning You might come across a warning like this during training: a2_abcs.py:882: UserWarning: Beam search not finished by t=100. Halted This just means your model failed to output an end-of-sequence token after t = 100. This may not mean you’ve made a mistake. On certain machines in the cluster and certain network configurations, even our solutions give this warning. However, it should not occur over all epochs. Your BLEU score shouldn’t change much whether or not you get this warning because it should only occur on a few sentences. If the warning keeps popping up or your BLEU scores are close to zero, you probably have an error. A.8 Recurrent cell type You’ll notice that the code asks you to set up a different recurrent cell type depending on the setting of the attribute self.cell type. This could be an LSTM, GRU, or RNN (the last refers to the simple linear weighting ht = σ(W[x, ht−1] + b) that you saw in class). The three cell types act very similarly – GRUs and RNNs are largely interchangeable from a programming perspective – except the LSTM cell often requires you to carry around both a cell state and a hidden state. Pay careful attention to the documentation for when h t, htilde t, etc. might actually be a pair of the hidden state and cell state as opposed to just a hidden state. Sometimes no change is necessary to handle the LSTM cell. Other times you might have to repeat an operation on the elements individually. Take advantage of the following pattern: 12 if self.cell_type == ’lstm’: # do something else: # do something else Be sure to rerun training with different cell types (i.e. use the flag –cell-type) to ensure your code can handle the difference. 13 B Variable names and slides We try to match the variable names in a2 abc.py to those in the lecture slides on machine translation. The table below serves as reference for converting between the two. In general, we denote a specific indexed value of a variable from the slides, such as Et , with an underscored variable name, e.g. E t. When an index is omitted, it means the variable contains all the indexed values at once, e.g. F corresponds to F1:S. Note that the slides are 1-indexed, whereas code is 0-indexed. Also, all variables in the PyTorch code are batched, but the slides only look at one sequence at a time. 14 Variable Slides Notes F F1:S Source sequence. F lens S In the code, the maximum-length source sequence in the batch is said to have length S. The actual length of each sequence in the batch (before padding) is stored in F lens. x x Encoder RNN inputs. h h1:S Encoder hidden states. Always refers to the last encoder layer’s hidden states, with both directions concatenated. htilde 0 h˜ 1 The first decoder hidden state. E tm1 Et−1 The target token at t − 1 (previous). xtilde t x˜t Decoder RNN input at time t (current). htilde t h˜ t Decoder hidden state at time t (current). For the LSTM architecture, this can be a pair with the cell state. Note in the beam search update htilde t also includes paths, i.e. h˜ (1:K) t . logits t log P(yt | . . .) + C Un-normalized log-probabilities over the target vocabulary at time t (current). Pre-softmax. E E1:T Target sequence. logits log P(yt | . . .) 1:T + C Un-normalized log-probabilities over the the target vocabulary across each time step. Pre-softmax. b tm1 1 b (1:K) t−1,1 All prefixes in the beam search at time t − 1 (previous). logpb tm1 log P(b (1:K) t−1 ) The log-probabilities of the beam search prefixes up to time t − 1 (previous). logpy t log P(yt | . . .) Valid (normalized) log-probabilities over the target vocabulary at time t (current). Post-softmax. b t 0 b (1:K) t,0 Decoder hidden states at time t (current). The difference between b (1:K) t,0 and h˜ (1:K) t is contextual: the latter points to the paths in the beam before the update, the former after the update. b t 1 b (1:K) t,1 All prefixes in the beam seach at time t (current). logpb t log P(b (1:K) t ) The log-probabilities of the beam search prefixes up to time t (current). c t ct Context vector (for attention) at time t (current). alpha t αt,1:S Attention weights over all source times at target time t (current). e t et,1:S Attention scores over all source times at target time t (current). 15