Description
Q1 (45 points): Write a script, viterbi.sh, that implements the Viterbi algorithm. You can reuse some functions from your check hmm.sh in Hw6.
• The format is: viterbi.sh input hmm test file output file • Your code should work for any HMM, not just the HMM for ngram POS taggers: – Do not do anything special for BOS and EOS. Do not insert BOS marker or EOS marker to the input. – Use input hmm as it is. Do NOT smooth it.
For instance, if there is no transition probability line from state si to sj , that means that it is impossible to go from si to sj . If there is no emission line for state sj and output symbol wk, that means that sj cannot generate wk. – Your code should be able to handle unknown “word” in the observation: let the observation be “o1 o2 … on”. For each oi , if oi does not appear in the input hmm at all, oi is an unknown word and should be treated as a special output symbol .
The special symbol can be generated by any state sj whose probability P(< unk >| sj ) is larger than 0. If for some sj , P(< unk >| sj ) is zero or does not appear in the input hmm, that means sj cannot generate this special symbol. • input hmm is an input file: – input hmm is a state-emission hmm and the output symbols are produced by the to-states.
It has the same format as the HMM format in Hw6. – You can assume that the input hmm does not contain any emission probability for empty string (i.e., a state cannot generate an empty string). – For hw7, you don’t need to check whether the three probability distributions (initial, transition, and emission ones) in the input hmm satisfy the constraints that are checked by check hmm.sh in Hw6.
If a line contains a probability that is not in the [0, 1] range, your code just prints out a warning message to stderr (“warning: the prob is not in [0,1] range: $line”, where $line is the line), ignore the line, and continue. • test file is an input file; – Each line is an observation (i.e., a sequence of output symbols). For instance, if you use HMM for POS tagging, an observation will be a sentence (cf. test.word): – Once again, do not insert anything (e.g., BOS or EOS marker) to the observation.
• The format of the output file (cf. sys) is “observ => state seq lgprob”: – state seq is the best state sequence for the observation. The length of state seq should be equal to the length of observ plus one. 1 – lgprob is lg P(observ, state seq); lg(x) is base-10 log. – If there is a tie (i.e., more than one state sequence with the highest probability), you can pick any of those sequences.
Q2 (30 points): Use viterbi.sh for trigram POS tagging: • The input hmm for viterbi.sh is the one for a trigram POS tagger: – The state name has the format “tag1 tag2”, and the output symbol is produced by the to-state. – hw7/examples/hmm[1-5] are some examples of the input hmm. For the transition and emission probability lines, please ignore anything after ##.
• Write your own script, conv format.sh, to convert the format of the output file of viterbi.sh. – The format of the command line is “cat file1 | conv format.sh > file2”. – file1 is the file created by viterbi.sh, and has the format “observ => state seq lgprob”. – file2 has the format “w1/t1 w2/t2 … wn/tn”. where ti is the second tag of the state that generates wi .
– For instance, if file1 has a line “w1 w2 … wn => x t0 t0 t1 t1 t2 … tn−1 tn lgprob”, conv format.sh should print “w1/t1 … wn/tn” to stdout, which can then be redirected to file2. Note that x, t0 (the two tags in the 1st state in the state sequence), and lgprob should NOT be included in the output string. • Run calc tagging accuracy.pl (which is given to you) to calculate the tagging accuracy. – The format is: calc tagging accuracy.pl gold standard sys res > sys res.acc – gold standard and sys res have the format “w1/t1 w2/t2 … wn/tn” (e.g., test.word pos).
– The gold standard for the file test.word is test.word pos, and the sys res is the file created by conv format.sh. • Fill out Tablel 1 with each of the HMM files under hw7/examples/. For instance, to get the accuracy for the first row in Table 1, you should run the following commands: – viterbi.sh hmm1 test.word sys1 – cat sys1 | conv format.sh > sys1 res – calc tagging accuracy.pl test.word pos sys1 res > sys1 res.acc 2>&1
• Submit the files as specified in submit-file-list. The submission should include: • readme.[txt|pdf] that includes Table 1. • hw.tar.gz that includes the files specified in submit-file-list. 2 Table 1: Tagging accuracy HMM model tagging accuracy hmm1 hmm2 hmm3 hmm4 hmm5 3