Description
CS539 Natural Language Processing HW 1
CN Y RD NGLSH WTHT VWLS? In this assignment, you will build a finite-state machine to automatically restore
vowels to vowelless text. (In English, this may not be a very useful thing to do, but in languages like Arabic and
Hebrew where vowels are regularly omitted (cf. abjad1
), it is extremely useful.)
Before you begin
β’ Finish EX1 individually.
β’ (Optional) Read Sections 1 and 2 of βA Primer on Finite-State Software for Natural Language Processingβ
(https://www.isi.edu/licensed-sw/carmel/carmel-tutorial2.pdf). A brief overview of some of Carmelβs
command-line options (run carmel to see complete list):
-si expect a string on stdin
-l compose stdin onto the left of the named FSA/FST(s)
-r compose stdin onto the right of the named FSA/FST(s)
-O print only output labels, suppress input labels
-I print only input labels, suppress output labels
-k n list k sequences rather than the whole FSA/FST
-b batch mode: process multiple input lines
-WE suppress weights and empty labels in k-best lists
β’ Download https://classes.engr.oregonstate.edu/eecs/fall2019/cs539-001/hw1/hw1-data.tgz:
vocab list of English words
vocab.small shorter list (for testing purposes only)
strings English sentences
strings.bad English sentences with bad spelling
eval.py script for measuring accuracy
1 Finite-state acceptors
1. Iβve created an example FSA try.fsa for sequences consisting of the three English words: an, at, and age,
separated by an underscore between consecutive words, and nothing else. It was done using a simple prefix tree
idea. You can test it like this:
echo “A T _ A N _ A G E” | carmel -slibOWE try.fsa
Based on this example, try to create (by hand) the FSA small.fsa for words in vocab.small. Include a
drawing/illustration of this FSA in your PDF report.
2. Write a python program make.py to create an FSA english.fsa that accepts all strings consisting of English
words (as defined by vocab) separated by one underscore between consecutive words, and nothing else. The
command-line of your python program should be like this:
cat vocab | python make.py > english.fsa
(FYI my make.py has only 25 lines.) The FSA should be letter-based, not word-based: that is, transitions should
be labeled with letters, not whole words. For example, it should accept:
1See https://en.wikipedia.org/wiki/Abjad.
1
T H I S _ I S _ A _ S T R I N G
(a) How many states and how many transitions does your FSA have? (Use carmel -c english.fsa.)
(b) Verify that your FSA accepts every line in strings and no lines in strings.bad. Show the output of
Carmel on the first five lines of each file. You can use commands like:
cat strings | carmel -slibOWE english.fsa
Hint: you should have either βΌ250k or βΌ180k states (and βΌ360k transitions or less). No need to minimize it.
2 Finite-state transducers
3. Create a FST in Carmel format called remove-vowels.fst that deletes all English vowels, preserving word
boundary information. For the purposes of this assignment, vowels are defined to be members of {A, E, I, O, U}.
For example, it should perform the following mappings:
Y O U _ A R E _ H E R E β Y _ R _ H R
R E A D _ A _ B O O K β R D _ _ B K
(a) Draw your FST in your PDF report in enough detail for someone else to replicate it.
(b) Test your FST in the forward direction on strings with the following command:
cat strings | carmel -slibOEWk 1 remove-vowels.fst
Show the output on the first five lines, and save the whole output to a file strings.novowels.
(c) Test your FST in the backward direction with the following command:
echo “B L D N G” | carmel -sriIEWk 10 remove-vowels.fst
The -k 10 option asks for up to 10 output strings. Just list them.
4. Now you can use backwards application of your FST to do vowel restoration.
(a) Run your vowel restorer on strings.novowels, using the following command:
cat strings.novowels | carmel -sribIEWk 1 remove-vowels.fst
Show the output on the first five lines, and save the whole output to a file strings.restored.
(b) Compute your vowel-restoration accuracy using the command:
python eval.py strings strings.restored
What was your accuracy?
(c) Why is the score so low?
Hint: should be around 1.3%.
3 Combining FSAs and FSTs
5. The vowel restorer you built in the previous part had a problem. Can you fix it by combining your FSA and
FST?
(a) Describe how to combine your FSA and FST to improve vowel restoration.
(b) Implement your idea and test it on strings.novowels. What is your new accuracy?
(c) Are the results satisfactory? Why doesnβt the machine do what a human would do?
Hint: should be around 30%.
6. How might your vowel restorer be further improved? Come up with at least one idea, and for each idea:
(a) Describe it in enough detail so that someone else could replicate it.
(b) Implement it and try it on strings.novowels.
(c) Report your new accuracy on strings.novowels. You will be graded on your final accuracy.
Hint: very easy to get above 90%.
Submit your code (*.py), your FSAs/FSTs (*.fsa, *.fst), your outputs (strings.*), and report.pdf.
CS539 Natural Language Processing HW 2
In Japanese text, foreign words borrowed in the modern era (such as Western names and technical terms)
are transliterated into special symbols called katakana. Katakana is a syllabary,
1
in which most symbols
stand for syllable sounds (such as ko or ra). Because spoken Japanese consists largely of consonant-vowel
syllables, most katakana symbols stand for two Japanese phonemes. In this assignment, among other things,
weβll decode katakana words of English origin back into English. Refer to the slides for this section and the
tutorial on writing systems and transliteration for more information (available from the course website).
Download https://classes.engr.oregonstate.edu/eecs/fall2019/cs539-001/hw2/hw2-data.tgz which
includes:
eword.wfsa a unigram WFSA of English word sequences
epron.wfsa a trigram WFSA of English phoneme sequences
eword-epron.data an online dictionary of English words and their phoneme sequences
eword-epron.wfst a WFST from English words to English phoneme sequences
epron-eword.wfst inverse transducer; the result of carmel -v eword-epron.wfst
epron-espell.wfst a WFST from English phoneme sequences to English letter sequences
epron-jpron.data a database of aligned English/Japanese phoneme sequence pairs
jprons.txt a short list of Japanese Katakana sounds to decode
epron.probs a human-readable version of epron.wfsa.
1 Shannon Game and Entropy of English (5 pts)
1. Each team member plays the game once.2 Report the entropy from each time (take a screenshot).
2. Explain how the entropy was calculated in this game (with equations).
2 Part-of-Speech Tagging as WFST Composition (10 pts)
Recall that in the lectures we did POS tagging for βI hope that this worksβ and βThey can fishβ by composing
a chain (word sequence), a flower (word/tag lexicon), and a tag bigram. Now you need to
1. build a Carmel WFSA bigram.wfsa for the tag bigram model (similar to the one from the slides, but
add PREP and AUX for prepositions and aux. verbs); you can assign probabilities in any reasonable way.
2. build a Carmel WFST lexicon.wfst for the word/tag lexicon (big enough to cover the examples
below); again, you can assign probabilities in any reasonable way, but make sure your pipeline is
mathematically sound!
3. compose them to solve POS tagging for the above two sentences plus the following examples (show the
resulting tag sequences from Carmel output):
1See https://en.wikipedia.org/wiki/Katakana and https://en.wikipedia.org/wiki/Syllabary.
2Try https://classes.engr.oregonstate.edu/eecs/fall2019/cs539-001/shannon.tgz and run appletviewer index.html.
1
(a) A panda eats shoots and leaves
(b) They can can a can
(c) Time flies like an arrow
4. Why is your composition pipeline mathematically sound? Write the equations.
5. What funny interpretations did you observe? What are possible ways to fix them? (no need to
implement them)
3 Pronouncing and Spelling English (25 pts)
1. Pick some English words and pronounce them with eword-epron.wfst. This transducer is essentially
a giant lookup table built from eword-epron.data. Turn in five (5) examples. Here is one:
echo HELLO | carmel -sliOEQk 5 eword-epron.wfst
2. Now letβs try to pronounce character sequences without whole-word lookup (since we may face unknown
words, for example). You can send a character string backwards through epron-espell.wfst. Try
several words and turn in your examples; here is a suggested command:
echo ‘H E L L O’ | carmel -sriIEQk 5 epron-espell.wfst
Did you get some non-sense output? Why? Explain it in terms of probabilitic modeling.
3. Letβs fix the nonsense by adding a language model of likely pronunciations, epron.wfsa. Try several
and turn in your examples.
echo ‘H E L L O’ | carmel -sriIEQk 5 epron.wfsa epron-espell.wfst
4. Now spell out some phoneme sequences by using epron-espell.wfst in the forward direction. Try
several and turn in your examples.
echo ‘HH EH L OW’ | carmel -sliOEQk 50 epron-espell.wfst
5. How to improve the above results? Well, you can try to take advantage of the language model
eword.wfsa (as a filter). But you need a βbridgeβ from espell to eword. Thatβs trivial, isnβt it? Write
a simple Python program gen_espell_eword.py to generate espell-eword.wfst from the word list
in eword-epron.data. And now you can:
echo ‘HH EH L OW’ | carmel -sliOEQk 50 epron-espell.wfst espell-eword.wfst eword.wfsa
Try several examples and turn them in. Include gen_espell_eword.py and espell-eword.wfst in
the submission also. (this only works for words in both espell-eword.wfst and eword.wfsa.)
6. Well, alternatively, you can go directly from epron to eword:
echo ‘HH EH L OW’ | carmel -sliOEQk 50 epron-eword.wfst eword.wfsa
Is this better than 5)? Try the same set of examples can compare the results with those from 5).
7. Try to pronounce some words that are not in the data file eword-espell.data. Try several examples
and turn them in.
echo WHALEBONES | carmel -sliOEQk 5 eword-epron.wfst
2
8. Now try pronouncing letter sequences of those same words. Try several examples and turn them in.
echo ‘W H A L E B O N E S’ | carmel -sriIEQk 5 epron.wfsa epron-espell.wfst
9. How about phoneme sequences of new words? What happens when you use eword-epron.wfst versus
epron-espell.wfst? Try several examples and turn them in.
echo ‘W EY L B OW N Z’ | carmel -sriIEQk 5 eword-epron.wfst
echo ‘W EY L B OW N Z’ | carmel -sliOEQk 5 epron-espell.wfst
10. You can hook up these transducers in unexpected ways. What is the following command trying to
accomplish?
echo ‘BEAR’ | carmel -sliOEQk 10 eword-epron.wfst epron-eword.wfst eword.wfsa
11. Write up a half-page or so of observations from your experiments. What did you learn about these
automata and what they can (or canβt) do? Please take time to express your thoughts clearly using
complete sentences.
4 Decoding English Words from Japanese Katakana (80 pts)
1. Look at epron-jpron.data. Each pair in this file consists of an English phoneme sequence paired with
a Japanese phoneme sequence. For each pair, Japanese phonemes are assigned integers telling which
English phonemes map to them. For example:
AE K T ER ;; English phoneme sequence for ‘actor’
A K U T A A ;; Same word, loaned into Japanese
1 2 2 3 4 4 ;; e.g., Japanese T maps to the 3rd English sound
From the data, we can see that each English phoneme maps to one or more Japanese phonemes.
Estimate the channel probabilities p(j | s) where j represents one or more Japanese phoneme(s), and
s an English phoneme, using the simplest maximum likelihood estimation (MLE). Write a Python
program estimate.py that generates the file epron-jpron.probs like this (each line represents one
p(j | s), and you can omit those with probability < 0.01 and you can omit those where one English
phoneme maps to more than three (3) Japanese phonemes since they are mostly noise):
AE : A # 0.95
AE : Y A # 0.05
T : T # 0.4
T : T O # 0.4
T : TT O # 0.1
T : TT # 0.1
R : A A # 0.8
R : E R # 0.1
R : E R U # 0.1
…
Transitions for each English phoneme should be consecutive (see above). Include estimate.py and
epron-jpron.probs.
2. Now, based on this, you can create a WFST called epron-jpron.wfst. It should probabilistically map
English phoneme sequences onto Japanese ones, behaving something like this in the forward direction:
echo ‘L AE M P’ | carmel -sliOEQk 5 epron-jpron.wfst
3
and youβll get a list like
R A M P
R U A M P
R A M U P
R A M P U
R A M PP U
…
Most transducers will consist of 300 transitions or so.
3. Using your knowledge about Japanese syllable structure (see slides), do these results (including their
rankings) make sense? If not, why (briefly explain)?
4. How could you improve the results to sound more βJapanese likeβ? (no need to implement it)
5. Now consider the following Japanese katakana sequences, from a Japanese newspaper (actually these
are the phoneme sequences that are easily read off from the katakana characters):
H I R A R I K U R I N T O N
D O N A R U D O T O R A N P U
B I D E O T E E P U
H O M A A SH I N P U S O N
R A PP U T O PP U
SH E E B I N G U K U R I I M U
CH A I R U D O SH I I T O
SH I I T O B E R U T O
SH I N G U R U R U U M U
G A A R U H U R E N D O
T O R A B E R A A Z U TCH E KK U
B E B I I SH I TT A A
S U K O TT O R A N D O
B A I A R I N K O N TCH E R U T O
A PP U R U M A KK U B U KK U P U R O
K O N P I U U T A S A I E N S U
H I J I K A R U T O R E E N I N G U
H I J I K A R U E K U S A S A I S U
A I S U K U R I I M U
H O TT O M I R U K U
T O R I P U R U R U U M U
K U R A U N P U R A Z A H O T E R U
H E E S U B U KK U R I S A A TCH I S A I E N T I S U T O
W O R U H U G A N G U M O TS U A R U T O
(For your convenience I have included a jprons.txt for the above sequences.)
How would you decode them into English phoneme sequences using carmel (with the WFST you just
built and with help of epron.wfsa)? Show the command-line and results (with probs, with -k 5
option). Do they make sense (as English words)?
6. Now redo the above exercise, but this time into English words by assembling eword.wfsa, eword-epron.wfst,
and epron-jpron.wfst with carmel. Show the command-line and results (with probs, with -k 5). Do
you think the results make more sense this time? Briefly explain.
7. Some desired outputs were not ranked top. Why? How would you improve the results? (No need to
implement any of these)
8. Try search for at least five (5) other Japanese Katakana examples not covered in this HW or in slides,
and try to decode them back into English using the above approach. Try to be creative. They should
preferably be a two word phrase or a compound word, like βshopping centerβ or βpiano sonataβ. You
can use Google Translate (from English to Katakana) to verify your results.
9. Are there other ways to decode into English words given all these existing WFSAs and WFSTs (with
some tweaking)? What are the differences of these methods (speed, accuracy, etc.)?
10. If you can make some new WFSAs/WFSTs such as espell.wfsa and espell-epron.wfst, what else
can you do to decode jprons.txt (not necessarily to English words)? (No need to implement these)
Submit a single file: hw2.zip, which includes *.{wfst,wfsa,probs,py} and separate report.pdf.
CS539 Natural Language Processing HW 3
In HW2, we did POS tagging, English pronunciation and spelling, and Katakana-to-English backtransliteration, all using Carmel. In this HW, you will implement the famous Viterbi algorithm used by Carmel for all these decodings; in doing so, you will implicitly implement βcompositionβ as well. In other words, you can view this HW as the βunder-the-hoodβ version of HW2. You will need again all the files we provided for HW2: eword.wfsa a unigram WFSA of English word sequences epron.wfsa a trigram WFSA of English phoneme sequences eword-epron.data an online dictionary of English words and their phoneme sequences eword-epron.wfst a WFST from English words to English phoneme sequences epron-eword.wfst inverse transducer; the result of carmel -v eword-epron.wfst epron-espell.wfst a WFST from English phoneme sequences to English letter sequences epron-jpron.data a database of aligned English/Japanese phoneme sequence pairs jprons.txt a short list of Japanese Katakana sounds to decode epron.probs a human-readable version of epron.wfsa. In addition, you will also need epron-jpron.probs and epron-jpron.wfst from your HW2 submission. But in case you didnβt get everything correctly, we have provided our reference solutions in https://classes.engr.oregonstate.edu/eecs/fall2019/cs539-001/hw3/hw3-data.tgz. (Note that all expected outputs are in the 5-bests, and only in a few cases the 1-bests are confused with similar-sounding words e.g., SHEET vs. SEAT, SAVING vs. SHAVING, MAKE BOOK vs. MAC BOOK, etc.). 1 Part-of-Speech Tagging (10 pts) Redo the POS tagging examples by implementing a bigram Viterbi algorithm (name it tagging.py) that would output the most probable POS tag sequence (and its probability) for an input sentence. Verify your results with Carmel outputs from HW2. After finishing bigram tagging, please work on trigram tagging, which paves the way for Part 2. This question is graded by completeness, not by correctness. 2 Decoding Katakana to English Phonemes (35 pts) Now implement your own Viterbi decoding in Python for HW2 Question 3.5 (jprons to eprons). Your command-line should be: echo -e ‘P I A N O\nN A I T O’ | ./decode.py epron.probs epron-jpron.probs with the result like: (note the input represents the Japanese katakana PI A NO, not the English word!) P IY AA N OW # 1.489806e-08 N AY T # 8.824983e-06 1 For your convenience, we have converted epron.wfsa into a human-readable epron.probs, in the same format as epron-jpron.probs above, e.g.: T S : # 0.774355 T S : AH # 0.0333625 T S : Z # 2.92651e-07 which correspond to our intuition that sounds T S (words with -ts) are very likely to end a sentence or a word, and sounds T S AH are much less likely (but still noticeable in words like WATSON and BOTSON), but T S Z is (almost) completely unheard of (with a tiny prob here due to smoothing). You only need to print the 1-best solution and its probability. Please 1. Describe your algorithm in English first. 2. Define the subproblem and recurrence relations. 3. Analyse its complexity. 4. Implement it in Python. Your grade will depend on both correctness and efficiency. Note that each English phoneme corresponds to one to three Japanese phonemes (the PIANO example is 1-1 and is too simple, and the NIGHT/KNIGHT example is more interesting), so the Viterbi algorithm from the slides needs to be extended a little bit. Compare your 1-best results with carmelβs. Try your best to match them. Include a side-by-side comparison (hint: diff -y). Hint: if you want to make sure your program is 100% correct with respect to Carmel, you can compare their results by: cat jprons.txt | ./decode.py epron.probs epron-jpron.probs > results.my cat jprons.txt | carmel -bsriIEQk 1 epron.wfsa epron-jpron.wfst 2>/dev/null \ | awk ‘{for (i=1;i<NF;i++) printf(“%s “, $i); printf(“# %e\n”, $NF)}’ \ > results.carmel diff -b results.my results.carmel You should see no output if you did everything correctly. To generate a side-by-side comparison, replace diff -b by diff -by. FYI, my not-very-optimized implementation is 60 lines of Python, and uses about 4x 2x time compared to Carmel. Your implementation will be graded in terms of both correctness and efficiency. 3 K-Best Output (15 pts) (kbest.py) Now extend your Viterbi algorithm to output k-best sequences (and their corresponding probabilities). Implement it on POS tagging and Katakana-to-English. Verify the results with Carmel using the diff method above (kbest.py should agree with Carmel output for k=1..10). Efficiency is part of the grade. Hint: see Huang and Chiang (2005). Algorithm 2 is enough to receive full credit for this part. 4 Extra Credit: Decode Katakana to English WORDS (30 pts) (Do NOT attempt this problem until you finish everything else. This is the hardest problem in this course.) Now if you are to decode with HW2 approach using eword.wfsa and eword-epron.wfsa and epron-jpron.wfsa. 1. Describe your algorithm in English first. 2. Define the subproblem and recurrence relations. 3. Analyse its complexity. 4. Implement decode_word.py. echo -e ‘P I A N O N A I T O’ | ./decode_word.py eword.probs eword-epron.data epron-jpron.probs PIANO NIGHT # 1.856048e-11 Here eword.probs is just a human-readable format of eword.wfsa. Verify your results with Carmel on jprons.txt. Just 1-best is more than enough! UPDATE: My code is orders of magnitude faster than Carmel, thanks to the trie. (one diff: BABYSITTER is in eword-epron.data but not in eword-epron.wfst)
CS539 Natural Language Processing, Homework 4
This homework is about unsupervised learning (using the EM algorithm) of the alignment between English and Japanese phonemes in English-Katakana pairs. Recall from HW2/3: AE K T ER ;; English phoneme sequence for βactorβ A K U T A A ;; Same word, loaned into Japanese 1 2 2 3 4 4 ;; e.g., Japanese T maps to the 3rd English sound But in this homework, you need to learn the alignments yourself. In fact, those alignments we gave you in HWs 2β3 were also learned by EM, and thus are not 100% correct. You will reuse the following files: epron-jpron.data a database of aligned English/Japanese phoneme sequence pairs ignore the aligments β your job is to learn them! eword.wfsa a unigram WFSA of English word sequences eword-epron.wfst a WFST from English words to English phoneme sequences jprons.txt a short list of Japanese Katakana sounds to decode 1 Complete Data, Incomplete Model (10 pts) If weβre given the following manually aligned pairs (English βboatβ and βtestβ): B OW T T EH S T B O O T O T E S U T O 1 2 2 3 3 1 2 3 3 4 4 What is the maximum likelihood estimate for p(T | T) and p(T O | T) ? 2 Complete Model, Incomplete Data (10 pts) Given the following conditional probabilities, p(B | B) = 1 p(O O | OW) = 0.8 p(O | OW) = 0.2 p(T | T) = 0.3 p(T O | T) = 0.5 p(O | T) = 0.1 p(O T O | T) = 0.1 Enumerate (by hand) all legal alignments for the βboatβ example above, and compute the probability and normalized probability of each alignment. Which one is the Viterbi alignment? 3 EM, Implementation I: Enumerating All Alignments (70 pts) Now implement the EM algorithm: 1. initialize the conditional probabilities to uniform 2. repeat: 3. E-step: 4. for each English-Katakana pair: 5. enumerate all legal alignments for that E-K pair, along with their respective probabilities 6. renormalize the probabilities to get a distribution over these alignments 1 7. collect the fractional counts for this E-K pair 8. M-step: count-and-divide based on the collected fractional counts from all pairs to get new prob. table 9. until reached maximum iteration or converged You should proceed with the following steps: 1. (15 pts) to start with, you should work on the simplest scenario (βwineβ example): W AY N / W A I N. Print the following information in each iteration: 1) the corpus probability, 2) the new prob. table (>= 0.01). 3) the number of nonzero entries in the prob. table (>= 0.01). Note: the corpus probibility should increase in each iteration (we have the proof), and the number of nonzero entries should decrease (as the model gets peakier). For example, you should run echo -e “W AY N\nW A I N\n” | ./em.py 5 and get the following (note that the data should be read in from standard input and 5 is the number of iterations): iteration 0 —– corpus prob= 0.00411522633745 AY|-> A: 0.33 A I: 0.33 I: 0.33 W|-> W: 0.67 W A: 0.33 N|-> N: 0.67 I N: 0.33 nonzeros = 7 iteration 1 —– corpus prob= 0.296296296296 AY|-> A I: 0.50 A: 0.25 I: 0.25 W|-> W: 0.75 W A: 0.25 N|-> N: 0.75 I N: 0.25 nonzeros = 7 … iteration 4 —– corpus prob= 0.912659654395 AY|-> A I: 1.00 W|-> W: 1.00 N|-> N: 1.00 nonzeros = 3 For your debugging convenience you can also print all possible alignments and their unnormalized and normalized probabilities in each iteration. Include the result of 5 iterations in your report. 2. (5 pts) If you add N AY N / N A I N to it you should see it converge faster. Show the results. 3. (5 pts) Now that you have passed βwineβ, try the βtestβ example above (T EH S T / T E S U T O) and youβll see that EM converges to something wrong. Show me the result of 5 iterations. Now come up with a minimal second example so that when combined with the βtestβ example, EM will learn the correct alignments. This second pair has to be a real English word, and should be choosen from epron-jpron.data. 4. (8 pts) Now come up with an example dataset where EM does not learn anything and is still ambivalent at the end (i.e., the final probability table is (almost) the same as the initial one). In the lecture we gave B IY / B I I as one such example, but can you come up with something bigger, i.e., longer sequences and multiple, interdependent pairs? 5. (15 pts) If you have passed all previous questions, itβs now time to work on the whole dataset with the following command-line: cat epron-jpron.data | ./em.py 15 >epron-jpron.probs 2>epron-jpron.logs (Ignore the alignment lines in the input β your job is to learn them yourself!) The logs file contains the logging info for all iterations (see above), and print the final prob. table to the .probs file. Theis file should follow the same format as in HW3 (ignore < 0.01 entries). Note: Do not reestimate probs from the final Viterbi alignments β youβll get HW3 probs that way. Hint: The corpus probs are likely to underflow (beyond the range of Python float). Try fix it. 2 6. (7 pts) Decode jprons.txt from HW3 using your newly learned epron-jpron.probs. Compare with your results from HW3: did you see any differences? 7. (5 pts) Generate Viterbi alignments from epron-jpron.probs: cat epron-jpron.data | ./viterbi_align.py epron-jpron.probs >epron-jpron.viterbi The result file should follow the same format as epron-jpron.data so that we can compare them. 8. (5 pts) Calculate the number of different viterbi alignments between your epron-jpron.viterbi and epron-jpron.data. It should be less than 10 (out of 2684 examples). 4 EM, Implementation II: Forward-Backward (DP) (30 pts) The enumeration of all alignments is not the best idea, since in the worst case, there are exponentially many such alignments per example. 1. (5 pts) BTW how many possible alignments are there for a pair of n English phonemes and m Japanese phonemes (n β€ m)? What pair(s) in epron-jpron.data give the highest number of alignments? 2. (20 pts) Implement the forward-backward algorithm (replacing lines 5β7 in the pseudocode above) so that you can do dynamic programming to collect fractional counts without enumerating all alignments. Make sure the two approaches match on their results. Include a detailed pseudocode of your implementation. What is the complexity of this forwardbackward algorithm for a (n, m) pair? Try optimize your program (em2.py). How does it compare to the enumeration approach in terms of speed? Part of the grade will depend on efficiency. (5 pts pseudocode/complexity, 10 pts correctness of implementation, 5 pts efficiency). 3. (5 pts) Try to construct an example so that em2.py is substantially faster than em.py and explain why. 5 EM for Decipherment (30 pts) Can you decode this? Hint: letter substitution cipher (i.e., permutation of 27 chars). gjkgcbkycnjpkovryrbgkcrvsczbkyhkahqvykgjvjkcpekrbkjxdjayrpmkyhkmhkyhkyvrcukrpkbjfjvcukihpygb oqykcykujcbykhpjkejihavcyrakvjmrijkjxrbybkrpkjfjvzkiclhvkarfrurtcyrhpkhvkaquyqvjkrpkygjkshvue auqobkrpkvjajpykzjcvbkgcfjkwhqpekohvcbkbhkerwwraquykyhkpjmhyrcyjksrygkygcykicpzkbgqpkgrbkaurjpyb gjkbcrekygjkoqvjcqkiqbykgcfjkygjkerbdqyjkvjbhufjekozkwjovqcvzkrpkhvejvkyhkdjvwhvikygjkajpbqb oqykygjkdqouraryzkbqvvhqperpmkygjkejocaujkajvycrpuzkbjvfjekyhkwhaqbkckbdhyurmgykhpkyghbjkjwwhvyb (Wait: how come there is no space at all, given that space also participated in the permutation??) Using a bigram model trained on Ex2βs train.txt with 50 iterations should be good enough. Hint: as shown in slides, start with a uniform model p(c | e) = 1/27 for all c, e: (the LM does not change!) p(c1 . . . cn) = X e1…en p(c1 . . . cn, e1 . . . en) = X e1…en p(e1 . . . en) Β· p(c1 . . . cn | e1 . . . en) (1) = X e1…en p(e1 . . . en) Β· p(c1 | e1)Β· Β· Β· p(cn | en) (2) = X e1…en p(e1 | )Β· Β· Β· p(ei | eiβ1)Β· Β· Β· p( | en) Β· p(c1 | e1)Β· Β· Β· p(cn | en) (3) After each iteration, print some debugging information like these (probs β€ 0.01 are considered zeros): $ cat cipher.txt | ./decipher2.py train.txt 50 epoch 1 logp(corpus)= -2270.05 entropy= 4.79 nonzeros= 567 … epoch 50 logp(corpus)= -1606.14 entropy= 3.39 nonzeros= 76 notice the initial entropy is similar to 0-gramβs, and the final entropy similar to 2-gramβs (see LM slides). why? Optional: Can you do trigram? (should be better: the stronger your LM is, the better you can decipher).
CS539 HW5: Syntax and Parsing
In this assignment you will 1. train a probabilistic context-free grammar (PCFG) from a treebank (need binarization); 2. build a simple CKY parser (which handles unary rules), and report its parsing accuracy on test data. Download https://classes.engr.oregonstate.edu/eecs/fall2019/cs539-001/hw5/hw5-data.tgz: corpora: train.trees training data, one tree per line test.txt test data, one sentence per line test.trees gold-standard trees for the test data (for evaluation) scripts: tree.py Tree class and reading tool evalb.py evaluation script 1 Learning a grammar (50 pts) The file train.trees contains a sequence of trees, one per line, each in the following format: (TOP (S (NP (DT the) (NN boy)) (VP (VB saw) (NP (DT a) (NN girl))))) (TOP (S (NP (PRP I)) (VP (VB need) (S (VP (TO to) (VP (VB arrive) (ADVP (RB early)) (NP (NN today)))))))) (TOP (SBARQ (WHNP (WDT what) (NN time)) (SQ (VBZ is) (NP (PRP it))))) N.B. tree.py provides tools for you to read in these trees from file (building a Tree object from a string). TOP S VP NP NN girl DT a VB saw NP NN boy DT the TOP S VP S VP VP NP NN today ADVP RB early VB arrive TO to VB need NP PRP I TOP SBARQ SQ NP PRP it VBZ is WHNP NN time WDT what Note that these examples are interesting because they demonstrate β’ unary rules on preterminals (POS tags) like NP -> PRP and NP -> NN; 1 β’ unary rules on nonterminals (constituents) like S -> VP and TOP -> S; β’ ternary rules like VP -> VB ADVP NP (so that you need binarization). β’ NP can have (at least) three roles: subject (the boy), object (a girl), and temporal-adverbial (today). The Penn Treebank does annotate these roles (e.g., NP-SBJ, NP-TMP), but for simplicity reasons we removed them for you. β’ a sentence can be a declarative sentence (S) or a wh-question (SBARQ), among others. Your job is to learn a probabilistic context-free grammar (PCFG) from these trees. Q1.a Preprocessing: replace all one-count words (those occurring only once, case-sensitive) with : cat train.trees | python replace_onecounts.py > train.trees.unk 2> train.dict where train.trees.unk is the resulting trees with symbols, and train.dict is a file of words which appear more than once (one word per line). Explain why we do this. Q1.b Binarization: we binarize a non-branching tree (recursively) in the following way: A B C D β A Aβ C D B e.g. VP VB NP PP ADVP β VP VPβ VPβ PP ADVP NP VB cat train.trees.unk | python binarize.py > train.trees.unk.bin What are the properties of this binarization scheme? Why we would use such a scheme? (What are the alternatives, and why we donβt use them?) Is this binarized CFG in Chomsky Normal Form (CNF)? Q1.c Learn a PCFG from trees: cat train.trees.unk.bin | python learn_pcfg.py > grammar.pcfg.bin The output grammar.pcfg.bin should have the following format: TOP TOP -> S # 1.0000 S -> NP VP # 0.8123 S -> VP # 0.1404 VP -> VB NP # 0.3769 VB -> saw # 0.2517 … where the first line (TOP) denotes the start symbol of the grammar, and the number after # in each rule is its probability, with four digits accuracy. We group rules into three categories: binary rules (A -> B C), unary rules (A -> B), and lexical rules (A -> w). How many rules are there in each group? 2 2 CKY Parser (70 pts) Now write a CKY parser that takes a PCFG and a test file as input, and outputs, for each sentence in the test file, the best parse tree in the grammar. For example, if the input file is the boy saw a girl I need to arrive early today the output (according to some PCFG) could be: (TOP (S (NP (DT the) (NN boy)) (VP (VB saw) (NP (DT a) (NN girl))))) (TOP (S (NP (PRP I)) (VP (VB need) (S (VP (TO to) (VP (VB arrive) (ADVP (RB early)) (NP (NN today)))))))) Q2.a Design a toy grammar toy.pcfg and its binarized version toy.pcfg.bin such that the above two trees are indeed the best parses for the two input sentences, respectively. How many strings do these two grammars generate? Q2.b Write a CKY parser. Your code should have the following input-output protocol: cat toy.txt | python cky.py toy.pcfg.bin > toy.parsed Verify that you get the desired output in toy.parsed. Note that your output trees should be debinarized (see examples above). Q2.c Now that you passed the toy testcase, apply your CKY parser to the real test set: cat test.txt | python cky.py grammar.pcfg.bin > test.parsed Your program should handle (any levels of) unary rules correctly, even if there are unary cycles. How many sentences failed to parse? Your CKY code should simply output a line NONE for these cases (so that the number of lines in test.parsed should be equal to that of test.txt). What are main reasons of parsing failures? Q2.d Now modify your parser so that it first replaces words that appear once or less in the training data: cat test.txt | python cky.py grammar.pcfg.bin train.dict > test.parsed.new Note that: β’ train.dict should be treated as an optional argument: your CKY code should work in either case! (and please submit only one version of cky.py). β’ your output tree, however, should use the original words rather than the symbol. Like binarization, it should be treated as internal representation only. Now how many sentences fail to parse? Q2.e Evaluate your parsing accuracy by python evalb.py test.parsed test.trees python evalb.py test.parsed.new test.trees Report the labeled precisions, recalls and F-1 scores of the two parsing results. Do their differences make sense?
CS 539 HW 6: Recurrent Neural Language Models
1 Playing with the NLM (0 pts) You can use the NLM to evaluate the probability of a given sequence: >>> NLM.load(‘large’) # load neural model from file; choose ‘base’, ‘large’, or ‘huge’ >>> h = NLM() # initialize a state (and observing ) >>> p = 1 >>> for c in ‘t h e _ e n d _ ‘.split(): >>> print(p, h) # cumulative prob and current state (and the distribution of the next char) >>> p *= h.next_prob(c) # include prob (c | …) >>> h += c # observe another character (changing NLM state internally) 1.000 ““: [t: 0.19, i: 0.10, a: 0.09, m: 0.07, s: 0.06, h: 0.06, c: 0.05, b: 0.04, p: 0.04, f: 0.04, r: 0.03, o: 0.03, d: 0.03, w: 0.03, e: 0.03, n: 0.03, l: 0.02, k: 0.02, g: 0.02, j: 0.01] 0.189 “ t”: [h: 0.82, o: 0.09, r: 0.03, e: 0.02, i: 0.01, w: 0.01] 1 0.156 “ t h”: [e: 0.88, i: 0.07, a: 0.02, r: 0.01] 0.138 “ t h e”: [_: 0.88, y: 0.04, r: 0.03, s: 0.01] 0.121 “ t h e _”: [s: 0.13, c: 0.09, f: 0.08, t: 0.07, p: 0.06, a: 0.05, m: 0.05, r: 0.05, b: 0.05, n: 0.04, w: 0.04, g: 0.04, l: 0.04, e: 0.04, d: 0.03, i: 0.03, h: 0.03, o: 0.03, u: 0.01, v: 0.01] 0.004 “ t h e _ e”: [n: 0.24, a: 0.13, x: 0.11, v: 0.08, p: 0.07, r: 0.05, l: 0.05, m: 0.04, s: 0.04, u: 0.03, f: 0.03, i: 0.02, c: 0.02, d: 0.02, g: 0.01, t: 0.01] 0.001 “ t h e _ e n”: [d: 0.41, g: 0.23, t: 0.18, v: 0.03, c: 0.03, o: 0.02, e: 0.01, r: 0.01] 0.000 “ t h e _ e n d”: [_: 0.84, : 0.05, i: 0.05, e: 0.02, s: 0.02] You can also use the NLM to greedily generate (i.e., always choose the most likely next character): >>> NLM.load(“large”) >>> h = NLM() >>> for i in range(100): >>> c, p = max(h.next_prob().items(), key=lambda x: x[1]) >>> print(c, “%.2f <- p(%s | … %s)” % (p, c, ” “.join(map(str, h.history[-4:])))) >>> h += c You get something like: t 0.19 <- p(t | … ) h 0.82 <- p(h | … t) e 0.88 <- p(e | … t h) _ 0.88 <- p(_ | … t h e) s 0.13 <- p(s | … t h e _) e 0.18 <- p(e | … h e _ s) c 0.31 <- p(c | … e _ s e) o 0.88 <- p(o | … _ s e c) n 1.00 <- p(n | … s e c o) d 1.00 <- p(d | … e c o n) _ 0.97 <- p(_ | … c o n d) s 0.13 <- p(s | … o n d _) e 0.22 <- p(e | … n d _ s) a 0.30 <- p(a | … d _ s e) s 0.82 <- p(s | … _ s e a) o 0.99 <- p(o | … s e a s) n 1.00 <- p(n | … e a s o) _ 0.88 <- p(_ | … a s o n) o 0.21 <- p(o | … s o n _) f 0.92 <- p(f | … o n _ o) _ 0.98 <- p(_ | … n _ o f) t 0.28 <- p(t | … _ o f _) h 0.90 <- p(h | … o f _ t) e 0.94 <- p(e | … f _ t h) _ 0.96 <- p(_ | … _ t h e) s 0.12 <- p(s | … t h e _) e 0.18 <- p(e | … h e _ s) c 0.26 <- p(c | … e _ s e) o 0.86 <- p(o | … _ s e c) n 1.00 <- p(n | … s e c o) d 1.00 <- p(d | … e c o n) _ 0.94 <- p(_ | … c o n d) s 0.13 <- p(s | … o n d _) e 0.21 <- p(e | … n d _ s) a 0.29 <- p(a | … d _ s e) s 0.81 <- p(s | … _ s e a) o 0.99 <- p(o | … s e a s) n 1.00 <- p(n | … e a s o) _ 0.87 <- p(_ | … a s o n) o 0.18 <- p(o | … s o n _) f 0.89 <- p(f | … o n _ o) _ 0.98 <- p(_ | … n _ o f) t 0.28 <- p(t | … _ o f _) h 0.90 <- p(h | … o f _ t) e 0.94 <- p(e | … f _ t h) _ 0.96 <- p(_ | … _ t h e) s 0.12 <- p(s | … t h e _) e 0.18 <- p(e | … h e _ s) c 0.26 <- p(c | … e _ s e) o 0.86 <- p(o | … _ s e c) n 1.00 <- p(n | … s e c o) d 1.00 <- p(d | … e c o n) _ 0.94 <- p(_ | … c o n d) s 0.13 <- p(s | … o n d _) e 0.21 <- p(e | … n d _ s) a 0.29 <- p(a | … d _ s e) s 0.81 <- p(s | … _ s e a) o 0.99 <- p(o | … s e a s) n 1.00 <- p(n | … e a s o) _ 0.87 <- p(_ | … a s o n) o 0.18 <- p(o | … s o n _) f 0.88 <- p(f | … o n _ o) _ 0.98 <- p(_ | … n _ o f) t 0.28 <- p(t | … _ o f _) 2 Evaluating Entropy (2 pts) 1. Like EX2, please first evaluate the trigram entropies (base and large) on EX2 test.txt 2 cat | sed -e ‘s/ /_/g;s/\(.\)/\1 /g’ | awk ‘{printf(“ %s \n”, $0)}’ \ | carmel -sribI Hint: both should be around 2.9. Q: Is the large version intuitively better than the small version? If so, why? If not, why? 2. Now write a short program to evaluate the entropy of NLMs on the same test.txt. Hint: they should be around 2.6 and 2.0, respectively. Q: Which version (base or large) is better? Why? 3. Is NLM better than trigram in terms of entropy? Does it make sense? 3 Random Generation (2 pts) 1. Random generation from n-gram models. Use carmel -GI 10 to stochastically generate character sequences. Show the results. Do these results make sense? 2. Write a short code to randomly generate 10 sentences (from to ) from NLMs. Hint: use random.choices() to draw the random sample from a distribution. 3. Compare the results between NLMs and trigrams. 4 Restoring Spaces (4 pts) 1. Recall from EX2 that you can use LMs to recover spaces: therestcanbeatotalmessandyoucanstillreaditwithoutaproblem thisisbecausethehumanminddoesnotreadeveryletterbyitselfbutthewordasawhole. First, redo the trigram experiments using our provided trigrams, and using Carmel. What are the precisions, recalls, and F-scores? (use eval_space.py). Hint: F-scores should be around 61% and 64%, respectively. 2. Then design an algorithm to recover spaces using NLMs. Note: you canβt use dynamic programming any more due to the infinite history that NLM remembers. You can use beam search instead. Describe your algorithm in English and pseudocode, and analyze the complexity. 3. Implement it, and report the precisions, recalls, and F-scores. Hint: F-scores should be βΌ81% and βΌ94% using beam size of 20 (and 100% using the βhugeβ model). 5 Restoring vowels (4 pts) 1. Redo trigram vowel recovery and report the accuracy. (use eval_vowels.py) Hint: should be around 37% 42% and 33% 41%. 2. Now design an algorithm to recover spaces using NLMs. Describe your algorithm in English and pseudocode, and analyze the complexity. 3. Implement it, and report the precisions, recalls, and F-scores. Hint: should be around 50% 54% and 77% 80% using beam size of 40. If you use the βhugeβ NLM, it will be around 93% (it might be slow to run). 6 Cancelled: Extra Credit: Decipherment with Neural LM (5 pts) Redo HW4 part 5 with NLMs. 3



