Description
1. Suppose we are tagging a sentence “They run programs” with two types of tags: N
(noun) and V (verb). The initial probabilities, transition probabilities and emission
probabilities computed by an HMM model are shown below.
N V
π 0.8 0.2
Table 1: Initial probabilities
N V
N 0.4 0.6
V 0.8 0.2
Table 2: Transition probabilities
they run programs
N 0.6 0.2 0.2
V 0 0.6 0.4
Table 3: Emission probabilities
(a) Given the HMM model above, compute the probability of the given sentence
“They run programs” by using the Forward AND Backward Algorithm. Please
show all the steps of calculations. [6 pts]
(b) Tag the sentence “They run programs” by using the Viterbi Algorithm. Please
show all the steps of calculations. [4 pts]
Solution. Solution goes here.
2. Suppose we are parsing a sentence “submit the homework through Canvas” with a
Probability Context Free Grammar (PCFG) in Chomsky Normal Form. We give the
PCFG as follows.
(a) Parse the sentence “submit the homework through Canvas” by using the CKY
Algorithm. You need to show the probabilities of all possible parses and choose
the parse with the highest probability. Please show all the steps of calculations.
[8 pts]
(b) Compute the probability of the given sentence “submit the homework through
Canvas” by using the CKY Algorithm. Please show all the steps of calculations.
[2 pts]
Solution. Solution goes here.
Rule Probability
S −→ NP VP 0.8
S −→ submit 0.01
S −→ Verb NP 0.05
S −→ VP PP 0.03
NP −→ Canvas 0.16
NP −→ Gradescope 0.04
NP −→ Det Nominal 0.6
Nominal −→ Nominal Noun 0.2
Nominal −→ Nominal PP 0.5
Nominal −→ homework 0.15
VP −→ submit 0.1
VP −→ contain 0.04
VP −→ Verb NP 0.5
VP −→ VP PP 0.3
PP −→ Prep NP 1.0
Det −→ the 0.6
Prep −→ through 0.2
Verb −→ submit 0.5
Table 4: Initial probabilities
3. For the programming task, you will implement several models for part-of-speech tagging.
We will use the LSTM as the basic architecture for the model and try several modifications
to improve performance. To do this, you will complete the Python code started in
POS tagging.ipynb, available on Canvas along with the train and test data (train.txt
and test.txt) in POS tagging.zip. For all the written questions to Q3 (accuracy
reporting and error analysis), please provide your answers in the space provided inside
the notebook.
NOTE: for each model that you test, you will produce a file test labels.txt that
you can upload to Gradescope to evaluate your model’s performance.
(a) To begin, implement an LSTM tagger by completing the skeleton code provided
in BasicPOSTagger in the notebook. The model will use word embeddings to
represent the sequence of words in the sentence.
For this problem, implement the forward pass and the training procedure for the
tagger. After training for 30 epochs, the model should achieve at least 80% on
the held-out data.
In addition, compute the top-10 most frequent types of errors that the model made
in labeling the data in test.txt (e.g. if the model labeled NN as VB 10 times) and
report these errors in the written answer. Report the results in a table containing
the top-10 mistakes, with the following columns: model tag type, ground truth
tag type, error frequency, up to 5 example words that were tagged incorrectly.
What kinds of errors did the model make and why do you think it made them?[5 pts]
(b) Word-level information is useful for part-of-speech tagging, but what about character
level information? For instance, the present tense of English verbs is marked with
the suffix -ing, which can be captured by a sequential character model.
For this problem, implement the character-level model by completing the skeleton
code provided in CharPOSTagger in the notebook. Unlike part (a), this model will
use character embeddings. After training for 30 epochs, the model should achieve
at least 80% on the held-out data.
In addition, compare the top-10 types of errors made by the character-level model
with the errors made by the word-level model from part (a) and report these errors
in the written answer. Report the error results in the same format as before. What
kinds of errors does the character-level model make as compared to the original
model, and why do you think it made them? [10 pts]
(c) The previous models considered only a single direction, i.e. feeding the information
from left to right (starting at word i = 1 and ending at word i = N). For the final
model, you will implement a bidirectional LSTM that uses information from both
left-to-right and right-to-left to represent dependencies between adjacent words
in both directions. In order to decide if the word “call” is VB or NN, the model can
use information from the following words to disambiguate (e.g. “call waiting” vs.
“call him”).
For this problem, implement the bidirectional model by completing the skeleton
code provided in BiLSTMPOSTagger in the notebook. Like the model in part (a),
this will use word embeddings. After training for 30 epochs, the model should
achieve at least 90% on the held-out data.
In addition, implement one of the three modifications listed in the notebook
to boost your score: (a) different word embeddings to initialize the model (one
different type), (b) different hyperparameters to initialize the model (five different
combinations), (c) different model architecture on top of the BiLSTM (one different
architecture). Explain in the written answer which modifications you have made,
why you chose a certain modification over another (e.g. Glove over word2vec)
and the resulting accuracy of the modified model.
Lastly, compare the top-10 errors made by this modified model with the errors
made by the model from part (a). Report the error results in the same format as
before. If you tested multiple different hyperparameters, compute the errors for
the model with the highest accuracy on the validation data. What errors does the
original model make as compared to the modified model, and why do you think
it made them? [10 pts]