## Description

1. Conditional Random Fields As we explained in class, one challenge of learning a conditional

random fields is to compute the denominator of the following equation

p(y | x) = exp(θ

>f(x, y))

P

y0∈YT exp(θ>f(x, y0)) (1)

where x = (x1, . . . , xT ), in the case of POS tagging, is a sequence of words with length T, y =

(y1, . . . , yT ) is the corresponding sequence of POS tags, and Y

T

is the set of all possible POS

sequences with length T. Assume the size of the pre-defined POS set is K, simple brute force

algorithm of computing P

y0∈YT exp(θ

>f(x, y

0

)) will have the complexity KT

, which is almost

impossible in practice. On the other hand, with the decomposition of f(x, y) as

f(x, y) = X

T

i=1

fi(xi

, yi

, yi−1) + fT +!(yT +1, yT ), (2)

with y0 = and yT +1 = , we are able to design an algorithm with time complexity O(T · K2

).

Note that, in equation (2) on lecture 08 slides, we ignore fT +!(yT +1, yT ) for simplicity.

• (3 points) Please follow a similar idea of Viterbi decoding to design a dynamic programming

algorithm that can compute

X

y0∈YT

exp(θ

>f(x, y

0

)).

with time complexity O(T · K2

). [Hint: The framework of your algorithm should look like

Algorithm 11 on page 150 in [Eisenstein, 2018], without the back tracing part.]

• (1 point) Please justify the time complexity of this algorithm is O(T · K2

).

2. POS Tagging From the attachment, you can find three files for this part.

• trn.pos

• dev.pos

• tst.pos

In both trn.pos and dev.pos, each line is one sentence, and each token consists of a word and its

POS tag, as

word/POS

As you may notice, the POS tag set is much smaller than the conventional POS tag set in the Penn

Treebank. There are only 10 tags in total

Y = {A, C, D, M, N, O, P, R, V, W}

1

Remember, there are also two special tags for POS tagging: and . We are going to estimate

both transition and emission probabilities first, then use them to build a POS tagger for the data

in the dev set.

Please following the steps to build your POS tagger

(a) Preprocessing. Scan the training examples in trn.pos and map the words with frequency less

than K = 2 into a special unknown token unk.

(b) From the training examples in trn.pos, estimate the transition probability as

p(y | y

0

) = #(y, y0

) + α

#(y

0) + α · (|Y| + 1) (3)

where y and y

0

can be any two possible tags in Y and |Y| is the size of Y, and α = 1. Note

that, there are alsoe two special cases p(y | ) and p( | y

0

) that you need to consider. Make

sure p(y | y

0

) are valid probabilities, such as

X

y∈Y∪{}

p(y | y

0

) = 1 (4)

where y

0 ∈ Y ∪ {}.

(c) From the training examples in trn.pos, estimate the emission probability as

p(x | y) = #(x, y) + β

#(y) + β · |V| (5)

where y ∈ Y, x ∈ V is any word in the vocab V, |V| is the size of V, and β = 1. Again, make

sure p(x | y) are valid probabilities, such as

X

x∈V

p(x | y) = 1 (6)

(d) (2 points) Convert the estimated probabilities from the previous step into log space and implement the Viterbi decoding algorithm as in Algorithm 11in [Eisenstein, 2018]. Report the

accuracy of your decoder on the dev set dev.pos.

(e) (2 points) Tune α and β to find the best accuracy on the dev set. Report the best accuracy

number and the values of α and β.

(f) (2 points) Run the model selected from the previous step on the test set tst.pos, and report

the accuracy.

Note that

• Please implement your code from scratch. You are not allowed to use any existing machine

learning package to solve this problem — it’s OK to use NumPy and SciPy if you want.

• To avoid zero point on this problem, please submit your code (including the best values of α

and β) with file name [ComputingID]-pos.py or [ComputingID]-pos.ipynb (if you use

IPython notebook).

References

[Eisenstein, 2018] Eisenstein, J. (2018). Natural Language Processing. MIT Press.

2