Description
All the example files are under ~/dropbox/19-20/570/hw8/examples/. Also see the slides for hw8 which are explained in class and posted at the schedule page on Canvas.
For hw8, we will use state-emission HMMs where the output symbols are produced by the to-states. Format of HMM files in hw8:: An HMM file (e.g., hmm ex1 and hmm ex2) has two parts: (1) A header that shows the numbers of states, output symbols, and lines for the three probability distributions, and (2) the three distributions (the lg prob field is optional).
The two parts might not be consistent; for instance, the header says that there are 10 states, but the distributions show that there are more than 10 states. In Q3 below, you will write a script that checks whether two parts are consistent, etc. state_num=nn ## the number of states sym_num=nn ## the size of output symbol alphabet init_line_num=nn ## the number of lines for the initial probability trans_line_num=nn ## the number of lines for the transition probability emiss_line_num=nn ## the number of lines for the emission probability \init state prob lg_prob ## prob=\pi(state), lg_prob=lg(prob) … \transition from_state to_state prob lg_prob ## prob=P(to_state | from_state) … \emission state symbol prob lg_prob ## prob=P(symbol | state) …
Questions: 1. Q1 (15 points): Write a script, create 2gram hmm.sh, that takes the annotated training data as input and creates an HMM for a bigram POS tagger with NO smoothing. • The format is: cat training data | create 2gram hmm.sh output hmm 1
• The training data is of the format w1/t1…wn/tn (cf. wsj sec0.word pos)
• The output hmm has the format specified above: – For prob and lg prob, keep 10 dights after the decimal point (same as hw6). – For each probability distribution (initial, transition, and emission probability), the probability lines should be sorted alphabetically on the 1st field (state or from state) first, and then for lines with the same 1st field, sort on the second field.
For instance, the emission probability lines are sorted by state first. For the lines with the same state, sort the lines by symbol. – The example files on patas are not sorted and rounded, as they were created before, so those files are not meant to be gold standard. 2.
Q2 (25 points): Write a script, create 3gram hmm.sh, that takes the annotated training data as input and creates an HMM for a trigram POS tagger WITH smoothing. • The format is: cat training data | create 3gram hmm.sh output hmm l1 l2 l3 unk prob file • The training data is of the format w1/t1…wn/tn (cf. wsj sec0.word pos) • The output hmm has the same format as in Q1. • unk prob file is an input file (not an output file). That is, the file is given to you and you do not need to estimate it from the training data. The file’s format is “tag prob” (see unk prob sec22): prob is P( | tag).
They are used to smooth P(word | tag); that is, for a known word w, Psmooth(w | tag) = P(w | tag) * (1−P( | tag)), where P(w | tag) = cnt(w,tag) cnt(tag) . • l1, l2 and l3 are λ1, λ2, λ3 used in interpolation: Pint(t3 | t1, t2) = λ3P3(t3 | t1, t2) + λ2P2(t3|t2) + λ1P1(t3). • When estimating P3(t3 | t1, t2), if the bigram t1t2 never appears in the training data, both count(t1, t2, t3) and count(t1, t2) will be zeros. The value of dividing zero by zero is undefined. For hw8, for the sake of simplicity, when t1t2 is unseen in the training data, let’s set P3(t3 | t1, t2) to be 1/(|T|+1) when t3 is a POS tag or EOS, and to zero when t3 is BOS. Here, |T| is the size of the POS tagset (which excludes BOS and EOS).
3. Q3 (25 points): Write a script, check hmm.sh, that reads in a state-emission HMM file, check its format, and output a warning file. The main purpose of this exercise is to read in an HMM file and store it in an efficient data structure, as you will use this data structure for HW9. Think about what data structure you want to use to store hmm. • The format is: check hmm.sh input hmm > warning file • Your code should check – whether the two parts of the HMM file are consistent (e.g., the number of states in the header matches that in the distributions), and • If the two parts are not consistent and/or the constraints are not satisfied, print out the warning messages to the warning file (cf. hmm ex1.warning). • In the note file, explain what data structure you use to store the HMM. 2
4. Q4 (10 points): Run the following commands and turn in the files generated by the commands: cat wsj_sec0.word_pos | create_2gram_hmm.sh q4/2g_hmm cat wsj_sec0.word_pos | create_3gram_hmm.sh \ q4/3g_hmm_0.1_0.1_0.8 0.1 0.1 0.8 unk_prob_sec22 cat wsj_sec0.word_pos | create_3gram_hmm.sh \ q4/3g_hmm_0.2_0.3_0.5 0.2 0.3 0.5 unk_prob_sec22 check_hmm.sh q4/2g_hmm > q4/2g_hmm.warning check_hmm.sh q4/3g_hmm_0.1_0.1_0.8 > q4/3g_hmm_0.1_0.1_0.8.warning check_hmm.sh q4/3g_hmm_0.2_0.3_0.5 > q4/3g_hmm_0.2_0.3_0.5.warning The submission should include: • The readme.[txt|pdf] file that includes your answer to Q3. • hw.tar.gz that includes the files specified in submit-file-list). 3