Description
B. Problems:
1. NFSA to Regular Expression (20 points)
a. (10 points) Write a regular expression for the language accepted by the FSA:
b. (10 points) Write a regular expression for the language accepted by the NFSA:
2. Bigram Probabilities (40 points):
Write a computer program to compute the bigram model (counts and probabilities) on
the given corpus (HW2_F17_NLP6320-NLPCorpusTreebank2Parts-CorpusA.txt
provided as Addendum to this homework on eLearning) under the following three (3)
scenarios:
i. No Smoothing
ii. Add-one Smoothing
iii. Good-Turing Discounting based Smoothing
Note:
1. Use the “ . ” string sequence in the corpus to break it into sentences.
2. Each sentence should be tokenized into words and the bigrams computed
ONLY within a sentence.
3. Please use whitespace (i.e. space, tab, and newline) to tokenize a sentence into
words/tokens that are required for the bigram model.
4. Do NOT perform any type of word/token normalization (i.e. stem, lemmatize,
lowercase, etc.).
5. Creation and matching of bigrams should be exact and case-sensitive.
Input Sentence: The Fed chairman warned that the board ‘s decision is bad
Given the bigram model (for each of the three (3) scenarios) computed by your
computer program, hand compute the total probability for the above input sentence.
Please provide all the required computation details.
Note: Do NOT include the unigram probability P(“The”) in the total probability
computation for the above input sentence.
3. Transformation Based POS Tagging (40 points)
For this question, you have been given a POS-tagged training file,
HW2_F17_NLP6320_POSTaggedTrainingSet.txt (provided as Addendum to this
homework on eLearning), that has been tagged with POS tags from the Penn
Treebank POS tagset (Figure 1).
Figure 1. Penn Treebank POS tagset
Use the POS tagged file to perform:
a. Transformation-based POS Tagging: Implement Brill’s transformation-based POS
tagging algorithm using ONLY the previous word’s tag to extract the best
transformation rule to:
i. Transform “NN” to “JJ”
ii. Transform “NN” to “VB”
Using the learnt rules, fill out the missing POS tags (for the words “standard” and
“work”) in the following sentence:
The_DT standard_?? Turbo_NN engine_NN is_VBZ hard_JJ to_TO work_??
b. Naïve Bayesian Classification (Bigram) based POS Tagging:
Using the given corpus, write a computer program to compute the bigram models
(counts and probabilities) required by the above Naïve Bayesian Classification
formula.
Using the created bigram models, hand compute the missing POS tags (for the words
“standard” and “work”) in the following sentence:
The_DT standard_?? Turbo_NN engine_NN is_VBZ hard_JJ to_TO work_??