## Description

1. (20 points) In Lecture 3, we looked at the outcomes of rolling two fair dice. For this problem, we will

consider weighted dice—one white, and one red. For each die, 1 and 6 are twice as likely to show as

the other four values.

a. What is the probability that the total showing on the two dice will be 7?

b. What is the probability that the total showing on the two dice will be 9 or higher?

c. What is the probability that the red die will show a higher number than the white one?

2. (35 points) The following is the first paragraph of Ernest Hemmingway’s The Old Man and The Sea.

It has been POS-tagged using the online Brill tagger at the Center for Sprogteknologi at Københavns

Universitet. A few minor changes have been applied.

This assignment does not require programming, but if you wish to work with an electronic version of

this information, you can refer to the following file:

/opt/dropbox/16-17/473/assignment3/old-man.txt

a. How many bigrams does the sample contain?

b. In a bigram model, we assume that a POS tag depends only on the POS tag of the preceding

word. Calculate 𝑃(. | NN), assuming that the counts in the above sample are perfectly

representative.

c. We are interested in the probability of the bigram DT JJ in the sample text. What is the value of

𝑃(DT JJ)?

d. A trigram model predicates a POS tag on the POS tags of the preceding bigram. Calculate

𝑃(NN | DT JJ) for the sample.

e. Assume this sample characterizes a larger corpus. Assume that measured probabilities are

independent. Estimate 𝑃(DT JJ | NN) for the corpus. (Hint: this will use Bayes’ Theorem.)

Show your work.

3. (15 points) For phonetic elicitation with a group of American test subjects, we are using three word

lists:

A = { gnat, beet }

B = { loon, fee }

C = { peel, pool, he, sand }

The test protocol is as follows: One of the lists is selected at random. Then, the subject is asked to

pronounce a randomly selected word from that list. What is the probability that the word will have a

high/close vowel (as opposed to low/open)? If you are not familiar with vowel phonetics, you can check

the Lecture 5 recording, or listen to samples on http://en.wikipedia.org/wiki/Vowel.

4. (30 points) A classifier has portioned a set of eight biomedical documents into

𝐶 = { mentions the IL-2R ⍺-promoter } (6 documents), and

𝐶̅ (the rest).

The gold standard indicates that only three documents actually mention the Interleukin-2 receptor alpha

promoter, and we determine that exactly one of them is (incorrectly) in 𝐶̅. In testing a post-processing

heuristic, we select a document at random from 𝐶 and move it in the class 𝐶̅. Next, we randomly select

a document from 𝐶̅.

a. What is the probability that the document we selected from 𝐶̅mentions the IL-2R ⍺-promoter

(according to the gold standard)?

b. Next, we note that the document we selected from 𝐶̅does, in fact (according to the gold

standard), mention the IL-2R ⍺-promoter. Given this additional information, what is the

probability that the document that we transferred from 𝐶 to 𝐶̅mentioned (according to the

gold standard) the IL-2R ⍺-promoter (i.e., that we moved it to the wrong class)?