Description

5/5 - (1 vote)

Tasks • Train a feed-forward neural network-based dependency parser and evaluate its performance on the provided treebank dataset. • (Bonus 1) Implement the arc-eager approach for the parser • (Bonus 2) Re-implement the feature extractor using Bi-LSTM as the backbone encoder. Submit • See REAME.md for submission requirements. • Report document is not mandatory for Task 1-5. It is required for bonus Task 6 and 7. Requirements 1) Implement the Feature Extractor (15 points) Following Chen and Manning (2014), the parser will use features from the top three words on stack 𝑠!, 𝑠”, 𝑠#, the top three words on buffer 𝑏!, 𝑏”, 𝑏#. Using the implementation from the lab practice, they correspond to stack[-1], stack[-2], stack[-3], and buffer[-1], buffer[-2], buffer[-3], respectively. We also include their POS tags as feature, i.e., 𝑡”, … ,𝑡$. Specifically, words and tags are first associated with embedding vectors: 𝑒(𝑤%) and 𝑒(𝑡%) for 𝑖 = 1, … , 𝑛, where n is the length of input sentence. Then the feature for the current configuration c is: 𝜙(𝑐) = 𝑒(𝑠#)⨁𝑒(𝑠”)⨁𝑒(𝑠!)⨁𝑒(𝑏!)⨁𝑒(𝑏”)⨁𝑒(𝑏#)⨁ 𝑒(𝑡𝑠#)⨁𝑒(𝑡𝑠”)⨁𝑒(𝑡𝑠!)⨁𝑒(𝑡𝑏!)⨁𝑒(𝑡𝑏”)⨁𝑒(𝑡𝑏#) Here, 𝑡𝑠% is the POS tag of the 𝑖th word on stack, and 𝑡𝑏% is the tag of the 𝑖th word on buffer. Note that they are NOT the 𝑖th word in the sentence. In some configurations, the stack or buffer may have fewer than 3 words. In those cases, use pseudo tokens “” to replace the missing blanks, which is also associated with an embedding 𝑒(“〈𝑁𝑈𝐿𝐿〉”). So is the special token “”, 𝑒(“〈𝑅𝑂𝑂𝑇〉”). Their POS tags (which do not exist) can be some pseudo values as well. For example, if the buffer contains [“apple”, “trees”, “grow”] and the stack contains [“”, “the”], then the concatenated word vectors are: 𝑒(“〈𝑁𝑈𝐿𝐿〉”)⨁𝑒(“〈𝑅𝑂𝑂𝑇〉”)⨁𝑒(“𝑡ℎ𝑒”)⨁𝑒(“𝑎𝑝𝑝𝑙𝑒”)⨁𝑒(“𝑡𝑟𝑒𝑒𝑠”)⨁𝑒(“𝑔𝑟𝑜𝑤”) You can infer the concatenated tag vectors similarly. Chen et. al. (2014) uses embedding size 𝑑 = 50. You can use larger values. 2) Implement Scoring Oracle (10 points) The feature extracted will be passed to the scoring oracle that determines the transition action 𝑡 given the feature 𝜙(𝑐) extracted from the feature function. The scoring function could be multiple layer perceptron (MLP): 𝑆𝑐𝑜𝑟𝑒&(𝜙(𝑐),𝑡) = 𝑀𝐿𝑃&G𝜙(𝑐)H[𝑡] , where 𝑀𝐿𝑃&(𝑥) = 𝑊[#] ⋅ tanhG𝑊[“] ⋅ 𝑥 + 𝑏[“] H + 𝑏[“] (Chen et. al. (2014) uses hidden layer size ℎ = 200 and tanh activation. You can use ReLU.) The oracle model should be implemented as a sub-class of torch.nn.Module. Two versions of oracle are required: BaseModel (using words only) and WordPOSModel (using words + POS) 3) Training (5 points) In which the forward function takes input a sequence of training data instances A training instance has two parts, 𝑋 and 𝑦. 𝑋 represents the current configuration states, that is, the integer IDs of the 6 words in stack and buffer, and their POS tag IDs. You should implement how these IDs are converted to embedding vectors as described in Task 1 and 2, and then compute the loss(negative log-likelihood loss) between the predicted transition action 𝑦V and the ground truth 𝑦. 4) Implement the Parser for Inference (15 points) We want the parser to be able to parse an unannotated input sentence, so you should also implement a parse_sentence function (as a member of a Parser class) that takes a sequence of words as input, and return the parsed tree. This function is very similar to forward in terms of computation, with the exception that it does not know the gold-standard transition actions. This function’s job is to accomplish the following transition-based greedy parsing: Figure from Kiperwasser and Goldberg (2016) There are several notable places of the above algorithm: ² You can implement the exit condition TERMINAL(c) using an if statement to check if “” is the only element on stack and the buffer is empty. ² The argmax operation means that you select the highest scoring transition, but unfortunately it is possible that the highest scoring transition is not possible. Therefore, instead of selecting the highest-scoring action, you should select the highest scoring permitted transition, indicated by the 𝑡 ∈ LEGAL(𝑐) subscript. 5) Evaluation (5 points) Run the evaluation script and check the LAS and UAS scores on the dev and test sets. ** Grading rubrics ** • If your model is implemented correctly and the training code can run without problem, then you get the full credits for Task 1, 2, and 3. • If your UAS score x>=70% on the test set, then you get full points for Task 4 and 5. • If 50%<=x<60%, you get 10 points for Task 4 and 2.5 points for Task 5 • If x < 50%, then you receive 0 points for Task 4 and 5 **** Bonus Tasks 6) Implement the arc-eager approach (5 points) You only need to modify the get_training_instances function and the State class. You do not need to actually train the parser. Note that you should add the new “Reduce” action to the code. Prove that your implementation is different from the arc-standard approach by showing an example. (The example should come in the report) 7) Implement the Bi-LSTM-based encoder (5 points) Following Kiperwasser and Goldberg (2016), which uses words and POS tags as features. Given a 𝑛 -words input sentence 𝑤”, … , 𝑤$ together with the corresponding POS tags 𝑡”, … ,𝑡$ , we associate each word and POS tag with an embedding vector 𝑒(𝑤%) and 𝑒(𝑡%), and create a sequence of input vectors 𝑥”:$ in which each 𝑥% is the concatenation of the word and POS embeddings: 𝑥% = 𝑒(𝑤%)⨁𝑒(𝑡%) The input 𝑥”:$ is then passed to a Bi-LSTM encoder, which produces a hidden representation for each input element: 𝑣% = BiLSTM(𝑥%, 𝑖) The feature function is the concatenated Bi-LSTM vectors of the top 3 items on the stack and the first item on the buffer. That is, for a configuration 𝑐 = {stack:[… |𝑠#|𝑠”|𝑠!], buffer:[𝑏!| … ]}, the feature for this configuration is: 𝜙(𝑐) = 𝑣*!⨁𝑣*”⨁𝑣*#⨁𝑣+#, where 𝑣% = BiLSTM(𝑥”:$, 𝑖) In our implementation for the configuration state in the lab, the stack top 𝑠! corresponds to state.stack[-1], and the buffer front 𝑏! corresponds to state.buffer[-1], and so forth for other elements. If the stack contains fewer than 3 words or the buffer is empty, then use the special token “” to fill up the blanks. For example, if the buffer = […, “apple”] and stack = [“”, “the”], then we should use 𝑒(“〈𝑁𝑈𝐿𝐿〉”)⨁𝑒(“〈𝑅𝑂𝑂𝑇〉”)⨁𝑣”-./”⨁𝑣”0112/” as the feature. For another example, if the buffer is empty and stack = [“”, “the”], then we should use 𝑒(“〈𝑁𝑈𝐿𝐿〉”)⨁𝑒(“〈𝑅𝑂𝑂𝑇〉”)⨁𝑣”-./”⨁𝑒(“〈𝑁𝑈𝐿𝐿〉”) as the feature Note that here the vectors for “” and “” are not the output from Bi-LSTM because they not actual words, and they are just returned by the embedding table. (If your Bi-LSTM implementation is complex, explain your code and demonstrate how performance changes in the report) ** Grading rubrics ** • For Task 6, you only need to change the State class and the get_training_instances function by adding “Reduce” to the transition action set. You do not need to train a new parser. • For Task 7, you need to have better LAC and UAC scores than Task 5. References Chen, D. and C. D. Manning (2014). A fast and accurate dependency parser using neural networks. Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). Kiperwasser, E. and Y. Goldberg (2016). “Simple and accurate dependency parsing using bidirectional LSTM feature representations.” Transactions of the Association for Computational Linguistics 4: 313-327.

Solved CS310 Assignment 5: Dependency Parsing

Download Details:

Description

Solved CS310 Assignment 5: Dependency Parsing

Download Details:

Description

Related products

CS 310 Homework 6 solution

CS 310 Homework 3 solution

Solved CS-310 Assignment 4 Problem 1. (Points: 10) Suppose a job ji is given by two numbers di and pi