CS 539-001 EX 2: Language Models and Entropy solution

$29.99

Original Work ?
Category: You will Instantly receive a download link for .ZIP solution file upon Payment

Description

5/5 - (4 votes)

1 Training Language Models
1. Train unigram, bigram, and trigram character language models. These models must assign non-zero
probability to any sequence. Store these models in Carmel’s WFSA format. Evaluate your models on
text.txt. Try improve your models. Your grade will depend in large part on the entropies
of your models on the blind test data.
Name your models: unigram.wfsa, bigram.wfsa, trigram.wfsa.
2. What are the corpus probabilities your three models assign to the test data? What are the respective
(cross) entropies?
(Hint: use
cat | sed -e ‘s/ /_/g;s/\(.\)/\1 /g’ | awk ‘{printf(“ %s \n”, $0)}’ \
| carmel -sribI
for corpus probability. The sed command inserts a space between letters and replaces the original
space by , i.e., it is becomes i t i s.)
3. What are the final sizes of your WFSAs in states and transitions.
(Hint: use carmel -c ).
4. Include a description of your smoothing method. Your description should include the algorithms, how
you used the held-out data, and what (if any) experiments you did before settling on your solution.
5. Include a sketch drawing of your 3-gram model in sufficient detail that someone could replicate it.
Please consider this carefully – examine your drawing after you have drawn it and evaluate whether
someone (not you) could build the same WFSA you have built. We will consider it in the same light.
1
Important:
• There is a special start symbol and special end symbol around every sentence, so that we can
model for example the fact that letter x normally does not start a sentence, and the letter z normally
does not end a sentence (we omit punctuations).
Here is an example unigram WFSA on two characters (ab.wfsa):
2
(0 (1 1))
(1 (1 a 0.8))
(1 (1 b 0.1))
(1 (2
0.1))
Note that is not a token in the language, while is. That is why the probability of is
always 1, and the probability of
is normalized with other characters. We’ve also provided an
example uni.wfsa.
• To ensure that your WFSA represents a legitimate probability distribution, you should normalize
it with carmel -Hjn wfsa > wfsa.norm (-j for joint prob. model, i.e., WFSA-style instead of the
conditional WFST style). We will do this (before testing your models) in any case. You can try this
on the above sample and it will give you the same WFSA since it’s already normalized.
• To ensure that your WFSA represents a legitimate probability distribution, you should have a single
final state with no exiting transitions. The above sample also demonstrates this.
2 Using Language Models
Language models are often used for generation, prediction, and decoding in a noisy-channel framework.
1. Random generation from n-gram models.
Use carmel -GI 20 to stochastically generate character sequences. Show the results.
Do these results make sense? (For example, you can do the same on the above sample WFSA, and you
will see the proportion of a:b is indeed about 8:1.)
2. Restoring vowels.
Just as in HW1, we can decode text without vowels. Try doing that experiment on test.txt with the
language models you trained. What’s your command line? What are your accuracy results? Include the
input file test.txt.novowels and the result files test.txt.vowel_restored.{uni,bi,tri}. (Hint:
you can use sed -e ‘s/[aeiou]//g’ to remove vowels).
3. Restoring spaces.
Similarly, we can remove the spaces and try to restore them with the help of language models.
Try doing that experiment on test.txt with the language models you trained. What’s your command
line? What are your accuracy results? Include the input file test.txt.nospaces and the result files
test.txt.space_restored.{uni,bi,tri}. (Hint: you can use sed -e ‘s/ //g’ to remove spaces).
Finally, decode the following two sentences with your models:
therestcanbeatotalmessandyoucanstillreaditwithoutaproblem
thisisbecausethehumanminddoesnotreadeveryletterbyitselfbutthewordasawhole.
4. Which of the two decoding problems is easier? Write a paragraph of observations you made in these
experiments.
2