COMP 550 ASSIGNMENT 1 to 4 solutions

$100.00

Original Work ?
Category: You will Instantly receive a download link for .ZIP solution file upon Payment

Description

5/5 - (1 vote)

ASSIGNMENT 1 COMP 550

Question 1: Identify the Ambiguity (30 points)
Analyze the following passages by identifying the most salient instances of linguistic ambiguity that they
exhibit. Write a short paragraph for each that answers the following questions. What is the ambiguity,
and what are the different possible interpretations? What in the passage specifically causes this ambiguity? What domain does this ambiguity operate over (phonological, lexical, syntactic, orthographic,
etc.)? What sort of knowledge is needed for a natural language understanding system to disambiguate
the passage, whether the system is human or machine? Be more specific than simply saying “contextual
knowledge.”
1. Every student read a book.
2. The lion is a majestic animal.
3. Use of this sidewalk is prohibited by police officers.
4. My English teacher recently recovered from a bowel cancer operation… and he tried to show me a
semi colon. (Source: The 2016 UK Pun Championship)
5. She is my ex-mother-in-law-to-be.
1
Question 2: FST for German Verbal Conjugation (20 points)
Develop a FST to perform morphological analysis for the following German verbal conjugation table,
which shows verbs conjugated in the present tense:
Infinitive 1 Sg 2 Sg 3 Sg 1 Pl 2 Pl 3 Pl
Regular verbs
spielen (to play) spiele spielst spielt spielen spielt spielen
warten (to wait) warte wartest wartet warten wartet warten
gehen (to go) gehe gehst geht gehen geht gehen
arbeiten (to work) arbeite arbeitest arbeitet arbeiten arbeitet arbeiten
Verbs with a stem change
sprechen (to speak) spreche sprichst spricht sprechen sprecht sprechen
backen (to bake) backe b¨ackst b¨ackt backen backt backen
Irregular verbs
sein (to be) bin bist ist sind seid sind
haben (to have) habe hast hat haben habt haben
The morphological analyzer should provide the infinitive form of the verb, which we will take to be its
lemma, along with its POS, person and number agreement. For example, feeding “habe#” as input to
the final FST should result in the output “haben +V +1 +Sg”.
Your response should include three components:
• A schematic transducer in the style of Figure 3.13 in J&M (page 61)
• A lexicon table as in the top half of Figure 3.14 in J&M (page 62)
• A “fleshed-out” FST in the format of the bottom half of Figure 3.14 for the lexical items presented
above
Question 3: Sentiment Analysis (50 points)
In this question, you will train a simple classifier that classifies a sentence into either a positive or
negative sentiment. These sentences come from a movie review dataset constructed by the authors of
this paper:
Bo Pang and Lillian Lee, Seeing stars: Exploiting class relationships for sentiment categorization with
respect to rating scales, Proceedings of ACL 2005.
The goal of this question is to give you experience in using existing tools for machine learning and natural
language processing to solve a classification task. Before you attempt this question, you will need to
install Python 2 on the machine you plan to work on, as well as the following Python packages and their
dependencies:
• NLTK: https://www.nltk.org/
• NumPy: https://www.numpy.org/
• scikit-learn: https://scikit-learn.org/stable/
Download the corpus of text available on the course website. This corpus is a collection of movie review
sentences that are separated into positive and negative polarity. Your task is to train a sentence classifier
to distinguish them.
Data storage and format
The raw text files are stored in rt-polarity.neg for the negative cases, and rt-polarity.pos for the positive
cases.
Page 2
Preprocessing and feature extraction
Preprocess the input documents to extract feature vector representations of them. Your features should
be N-gram counts, for N ≤ 2. You may also use scikit-learn’s feature extraction module. You should
experiment with the complexity of the N-gram features (i.e., unigrams, or unigrams and bigrams), and
whether to remove stop words. NLTK contains a list of stop words in English. Also, remove infrequently
occurring words and bigrams as features. You may tune the threshold at which to remove infrequent
words and bigrams. You can also experiment with the amount of smoothing/regularization in training
the models to achieve better results. Read scikit-learn’s documentation for more information on how to
do this.
Setting up the experiments
Design and implement an experiment that correctly compares the model variants, so that you can draw
reasonable conclusions about which model is the best for generalizing to similar unseen data. Compare
the logistic regression, support vector machine (with a linear kernel), and Naive Bayes algorithms. Also,
compare against the expected performance of a random baseline, which just guesses positive or negative
with equal probability.
Report
Write a short report on your method and results, carefully document i) the problem setup, ii) your
experimental procedure, iii) the range of parameter settings that you tried, and iv) the results and
conclusions. It should be no more than one page long. Report on the performance in terms of accuracy,
and speculate on the successes and failures of the models. Which machine learning classifier produced
the best performance? For the overall best performing model, include a confusion matrix as a form of
error analysis.
What To Submit
Submit your solutions to Questions 1 to 2, as well as the report part of Question 3 in class as a pdf. For the
programming part of Question 3, you should submit one zip file with your source code. All work should be
submitted to MyCourses under the Assignment 1 folder.
Page 3

ASSIGNMENT 2 COMP 550

Question 1: POS Tagging (20 points)
Train a first-order (i.e., the probability of a tag depends only on the previous tag) HMM part-of-speech
tagger by hand on the following training corpus. Find the MAP estimate of the parameters of the model
using add-1 smoothing.
That/C that/N is/V , is/V .
That/C that/N is/V not/N , is/V not/N .
Is/V that/N it/N ?
Is/V it/N that/N good/J ?
Ignore capitalization differences and punctuation. They are only there for readability purposes. There
should thus be a tagset of size 4, and a lexicon of size 5. Your model should contain the following parameters: Π (initial state probabilities), A (state transition probabilities), and B (emission probabilities).
Next, run the Viterbi algorithm (by hand) to obtain a POS tagging for the following test sentence. Show
the trellis cell values and your calculations.
Not that good .
Question 2: Grammar for French (40 points)
In this question, you will develop a context-free grammar for a fragment of French. Your grammar must
account for various aspects of the French language, as listed below.
Basic sentence word order in the present
The basic word order in French is Subject-Verb-Object, as in English:
(1) Je regarde la t´el´evision.
I watch the television
(2) Le chat mange le poisson.
The cat eats the fish
1
Subject-verb agreement
Just as in English, the subject must agree with the verb in number and person:
(3) Tu regardes la t´el´evision.
You(2Sg) watch the television
(4) Il regarde la t´el´evision.
He watches the television
(5) Nous regardons la t´el´evision.
We watch the television
(6) Vous regardez la t´el´evision.
You(2Pl) watch the television
(7) Ils regardent la t´el´evision.
They(Masc.) watch the television
Look up the list of subject pronouns in French, as well as the verb conjugation paradigm for several
common verbs using an online website. Include these in your grammar.
Reference: https://www.wordreference.com/conj/FrVerbs.aspx
Definite noun phrases and proper names
A definite noun phrase in French follows a similar order as in English (article + noun). However,
the article must agree with the noun in number and grammatical gender. Grammatical gender is a
more-or-less arbitrary categorization of nouns into either masculine or feminine.
Examples:
(8) Le chat
the(Masc.) cat
(9) La t´el´evision
the(Fem.) television
(10) Les chats
the(Pl.) cats
(11) Les t´el´evisions
the(Pl) televisions
As you can see, there is no distinction in the plural between masculine or feminine.
Some proper names in French do not take articles, just as in English:
(12) Jackie
Jackie
(13) Montr´eal
Montreal
Others do (e.g., le Canada), but you do not have to handle them.
Direct object pronouns
When a pronoun is a direct object of the verb, they precede the verb:
Page 2
(14) Il la regarde.
He it(Fem.) watches.
Look up the list of direct object pronouns in French, and enhance your grammar to account for the word
order with direct objects.
Attributive adjectives
Adjectives typically follow the noun that they modify in a noun phrase:
(15) Le chat noir
the(Masc.) cat black
(16) Le chat heureux
the(Masc.) cat happy
However, other adjectives precede the noun:
(17) Le beau chat
the(Masc.) beautiful cat
(18) Le joli chat
the(Masc.) pretty cat
Yet others may precede OR follow the noun, though the meaning usually changes slightly:
(19) La derni`ere semaine
the(Fem.) last week
the last week (e.g., of the year)
(20) La semaine derni`ere
the(Fem.) week last
last week (i.e., the one before this week)
In addition, adjectives must agree with the noun that they modify in number and gender:
(21) Les chats noirs
the(Pl.) cats black(Pl.)
the black cats
(22) La t´el´evision noire
the(Fem.) television black(Fem.)
the black television
(23) Les t´el´evisions noires
the(Pl.) televisions black(Fem. Pl.)
the black televisions
Note that adjectives do distinguish masculine from feminine in the plural.
Find several adjectives of each of the three classes above, and incorporate them into your grammar.
References
https://french.about.com/od/grammar/a/adjectives.htm
https://french.about.com/od/grammar/a/adjectives_4.htm
Page 3
Examples and submission format
You already have many examples that your grammar should accept (though many of the above examples
were only noun phrases, not full sentences). Here are some sentences that your grammar should reject:
(24) *Je mangent le poisson.
(25) *Les noirs chats mangent le poisson.
(26) *La poisson mangent les chats.
(27) *Je mange les.
Use the following nonterminals to indicate grammatical categories:
S sentence/clause
NP noun phrase
VP verb phrase
N noun
PN proper noun
PR pronoun
V verb
DT determiner
A adjective
You may add further non-terminal categories or subdivide them (e.g., V-1Sing) as needed. Don’t forget
the lexical rules! Include enough lexical items such that each of the syntactic categories can be expressed
in at least three different ways.
Write your grammar in a text editor using a predictable, computer-readable format. For instance, here
is one possible rule:
S -> NP VP
Here is another example of a set of four rules (here, they are lexical rules):
V-1Sg -> mange | aime | regarde | cherche
These are just examples, and are not necessarily the rules you want in your grammar! Ignore punctuation
and capitalization in your grammar (just use all lower-case, except for proper names). French has
contractions in many cases where a word begins with a vowel (e.g.,j’aime rather than *je aime). You
may ignore such issues.
Submit your grammar as a plaintext .txt file. Show instances where your grammar correctly accepts and
rejects some sentence. In addition, answer the following questions in your response to the question:
1. What are some advantages of modelling French grammar with a CFG, compared to using an FSA?
2. What are some disadvantages of modelling French grammar with a CFG?
3. What are some aspects of French grammar that your CFG does not handle?
This question is rather open-ended; your grammar will be judged on the following points:
• Whether you followed the specifications above (e.g. names of non-terminals, minimum number of
lexical entries)
• Coverage of the required grammatical constructions
• Clarity of the grammar
• The responses to the questions above
You won’t get extra points for having many additional lexical items that exhibit the same type of
behaviour!
Page 4
Question 3: Decipherment with HMMs (40 points)
We have intercepted coded communications from our enemies, the League Against Natural Language
Processing. We have obtained short samples containing both ciphertext (i.e., encrypted text) and its
associated plaintext (i.e., plain English). Being good experimentalists, we have separated the data into
a training set and a test set, so that we know that we will be able to decrypt future coded messages
from this despicable organization.
Actually, we have intercepted three different ciphers of the sample text:
1. The first cipher is quite archaic, first used by Julius Caesar. Each letter in the plain text is shifted
by a fixed amount to produce the cipher text.
2. The second cipher is a more complex cipher, in which there are two letter substitution ciphers. When
encrypting each letter, one of the two is randomly chosen, and that cipher is used to generate the
ciphertext letter.
3. The third cipher was invented by Gilbert Vernam! The cipher text is produced by adding integer
values of the a key and the plain text. We also know that the key is produced by shifting characters
in plain text by 3 places from right to left. For example if you need to encrypt the plain text
‘nlp is awesome.’ the key you will use is ‘ is awesome.nlp’. To generate the cipher you need
to add the integer values of the two strings character by character. Integer values for characters
from a-z is 0-25 and for ‘ ’, ‘,’, ‘.’ are ‘26’, ‘27’ and ‘28’ respectively. Thus the cipher text will be
‘ktexilbshqwnzpo’.
Note: The three cipher texts are available on the course website
The plaintext and ciphertext alphabets are both the 26 letters of the alphabet (lowercase), plus the
space, comma, and period, for a total of 29 symbols.
In this question, we will explore several HMM models for solving these ciphers. Each HMM sample will
be a sentence, and each time step within a sequence will be one character. The hidden layer will consist
of the plaintext characters, while the observed layer will consist of ciphertext characters.
Standard HMM
Implement a system which trains a standard HMM on the training set using MLE, and tests on the
testing data. Use NLTK’s nltk.tag.hmm. Report per-token accuracy on both ciphers. Print out the
results of decoding on the test set. What do you observe about the performance of the model?
Your code should run in the following way:
python decipher.py
It should print out the deciphered test set, and report the accuracy score. Since training should not take
that much time, you do not need to save the model output.
Laplace smoothing
Let’s see if smoothing will help improve performance. Modify your code and add an option that implements Laplace smoothing during training. The simplest method for doing so will likely be to modify the
HiddenMarkovModelTagger object that you got from training from the previous step. Consult NLTK’s
API in order to figure out how to do this. You may find the page on nltk.probability useful as well.
It should be possible to turn Laplace smoothing on at the command line in the following way:
python decipher.py -laplace
Page 5
Improved plaintext modelling
The training set that we have in this question is very small. Maybe we can further improve performance
by having a better model of character bigram transitions in English. Change your training procedure to
incorporate this information by preprocessing and getting character transition counts from the samples
of English you have from Assignment 1. These counts should supplement, not replace the counts that
you get from the original training set. You will have to deal with the following issues:
1. Sentence segmentation
2. Lower-casing the text
3. Removing any other character which is not one of the 29 symbols
4. Removing extra space from the beginning and end of each sentence
How does this change affect the results?
It should be possible to turn this option on at the command line in the following way:
python decipher.py -lm
For this step, you should save and submit either the preprocessed A1 corpus or your counts thereof,
so that the TA can reproduce your results. In addition, it should be possible to turn on both Laplace
smoothing and the improved language modelling.
Report and submission requirements
Experiment on the three ciphers, reporting accuracy for each of the settings in a table. Write a brief (max.
half-page) report on the results, noting whether each change was successful in improving performance.
Were there any surprises or unexpected results? Do you achieve close to perfect accuracy? If not, why
not? Try to explain and speculate on the reasons for these results.
What To Submit
Your submission must be made through MyCourses, and must consist of the following three files:
1. The written portion of Questions 1, 2 (except the grammar), and 3 as a .pdf file called ‘a2-
written.pdf’.
2. The grammar file of Question 2, as a plaintext .txt file called ‘french-grammar.txt’.
3. For the programming part of Question 3, you should submit your source code and any other files
necessary to regenerate your results as described above as a single .zip file called ‘a2-q3.zip’.
Page 6

ASSIGNMENT 3 COMP 550

Question 1: Lambda Calculus and Compositional Semantics (30 points)
a) Simplify the following lambda calculus expressions by applying beta-reduction as much as possible.
Show each step of the derivation.
• (λx.x x) (λy. y x) z
• (λuvw.wvu) aa (λpq.q)
• ((λv.v v) (λu.u)) ((λv.v)(λv.w))
b) Augment the grammar given in Lecture 14 to account for the quantifier no, including the corresponding
lexical rule and semantic attachment. Show a derivation of the sentence No student hates COMP-550.
(You’ll also need to add an entry for the verb hates.) Use explicit event variables, and be sure to show
the intermediate lambda expressions at each node of the parse tree.
c) Show how to construct an underspecified representation of the sentence No student wants an exam
using the Cooper storage scheme presented in class. Show how to recover both interpretations of the
sentence (and explain what the interpretations are).
Question 2: Lesk’s Algorithm (40 points)
Implement and apply Lesk’s algorithm to the publicly available data set of SemEval 2013 Shared Task
#12 (Navigli and Jurgens, 2013), using NLTK’s interface to WordNet v3.0 as your lexical resource.
(Be sure you are using WordNet v3.0!) The relevant files are available on the course website. Starter
code is also provided to help you load the data. More information on the data set can be found at
https://www.cs.york.ac.uk/semeval-2013/task12/.
The provided code will load all of the cases that you are to resolve, along with their sentential context.
Apply word tokenization and lemmatization (you have code to do this from A1) as necessary, and remove
stop words.
As a first step, compare the following two methods for WSD:
1. The most frequent sense baseline: this is the sense indicated as #1 in the synset according to
WordNet
1
2. NLTK’s implementation of Lesk’s algorithm (nltk.wsd.lesk)
Use accuracy as the evaluation measure. There is sometimes more than one correct sense annotated in
the key. If that is the case, you may consider an automatic system correct if it resolves the word to any
one of those senses. What do you observe about the results?
Next, develop a third method that combines distributional information about the frequency of word
senses, and the standard Lesk’s algorithm. Make and justify decisions about any other parameters to
the algorithm, such as what exactly to include in the sense and context representations, how to compute
overlap, and how to trade off the distributional and the Lesk signal, with the use of the development
set, which the starter code will load for you. You may use any heuristic, probabilistic model, or other
statistical method that we have discussed in class in order to combine these two sources of information.
It is beyond the scope of the assignment to use external corpora or lexical resources (e.g., thesauri,
or WordNet in other languages, etc.), so do not do so. Given these constraints, feel free to use your
creativity to find ways to improve performance!
Some issues and points to watch out for:
• The gold standard key presents solutions using lemma sense keys, which are distinct from the
synset numbers that we have seen in class. You will need to convert between them to perform the
evaluation. This webpage https://wordnet.princeton.edu/man/senseidx.5WN.html explains
what lemma sense keys are.
• The data set contains multi-word phrases, which should be resolved as one entity (e.g., latin america).
Make sure that you are converting between underscores and spaces correctly, and check that you
are dealing with upper- vs lower-case appropriately.
• We are using instances with id beginning with d001 as the dev set, and the remaining cases as the
test set, for simplicity. This is different from the setting in the original SemEval evaluation, so the
results are not directly comparable.
Discuss the results of your experiments with the three models. Also include a discussion of the successes
and difficulties faced by the models. Include sample output, some analysis, and suggestions for improvements. The entire report, including the description of your model, must be no longer than two pages.
Going beyond this length will result in a deduction of marks.
Your grade will depend on whether you adequately followed the guidelines above, whether you followed
standard model design and experimental procedure during the development of your method, and on the
quality of your report (both linguistic, and content).
Question 3: Reading Assignment — Compositional Distributional Semantics (30 points)
Read the following paper:
Jeff Mitchell and Mirella Lapata. Vector-based Models of Semantic Composition ACL 2008. http:
//aclweb.org/anthology/P/P08/P08-1028.pdf
Write a max. one-page (c. 500 words) discussion on this paper, including the following points:
1. A brief summary of the contents of the paper, including the theoretical framework and the experiments.
2. An evaluation and synthesis of what you learned in the paper. What are the advantages and
limitations of this work? How does it relate to the topics that we have discussed in class on
unsupervised learning and frame semantics?
3. Three questions related to the paper. These can be clarification questions, or questions about
potential extensions of the paper, or its relationship to other work.
Page 2
What To Submit
Electronically: Submit a .pdf containing the written answers to Question 1 and 3 as well as the report
part of Question 2, called ‘a3-written.pdf’. For the programming part of Question 2, you should submit one
zip file called ‘a3-q2.zip’ with your source code to MyCourses under Assignment 3.
Page 3

ASSIGNMENT 4 COMP 550

Question 1: Multi-document Summarization (60 points)
This question asks you to implement a simple, but surprisingly effective algorithm for multi-document
summarization, SumBasic:
Ani Nenkova and Lucy Vanderwende. The Impact of Frequency on Summarization. Microsoft Research,
Redmond, Washington, Tech. Rep. MSR-TR-2005-101. 2005. https://www.cs.bgu.ac.il/~elhadad/
nlp09/sumbasic.pdf
a) Use a news aggregator tool such as Google News to find four clusters of articles on the same event or
topic. Each cluster should contain at least three articles, and each article should be of sufficient length
to generate an interesting summary from (at least 3–4 paragraphs).
You should clean the article text by removing all hyperlinks, formatting, titles and other items that are
not the textual body of the articles. Use any method to do this (including by hand). You may have to
deal with non-ASCII characters. You can handle them any way you like, including just replacing them
by a similar-looking ASCII character. Save your input into text files called docA-B.txt, where A is an
positive integer corresponding to the cluster number, and B is another positive integer corresponding to
the article number within that cluster. For example doc1-2.txt is the second article in the first cluster.
Put all of your documents inside a subfolder called /docs.
b) Implement SumBasic, as it is described in the lecture notes, in order to generate 100-word summaries
for each of your document clusters. Compare these two versions of the system:
1. orig: The original version, including the non-redundancy update of the word scores.
2. simplified: A simplified version of the system that holds the word scores constant and does not
incorporate the non-redundancy update.
Compare these versions against a third method, leading, which takes the leading sentences of one of
the articles, up until the word length limit is reached. You may decide on how to select the article
arbitrarily.
You should apply the standard preprocessing steps on your input documents, including sentence segmentation, lemmatization, ignoring stopwords and case distinctions. The main method that should run
1
your code should be in a file called sumbasic.py. Your code should be run using the following command
structure:
python sumbasic.py *
And it should print the output summary to standard output.
For example, running
python ./sumbasic.py simplified ./docs/doc1-*.txt > simplified-1.txt
should run the simplified version of the summarizer on the first cluster, writing the output to a text file
called simplified-1.txt.
c) Discuss quality of each of the three methods. Does the non-redundancy update work as expected?
How are the methods successful or not successful? How would you order the summary sentences with
the SumBasic methods, or another extractive summarization approach? Be sure to cover all aspects of
summary quality that we discussed in class.
Question 2: Reading Assignment — Abstractive Summarization (40 points)
Read the following paper:
Trevor Cohn and Mirella Lapata. Sentence Compression Beyond Word Deletion. COLING 2008. http:
//www.aclweb.org/anthology/C08-1018.pdf
Write a max. one-page (c. 500 words) discussion on this paper, including the following points:
1. A brief summary of the contents of the paper, including the theoretical framework and the experiments.
2. The limitations of the approach. What do you suggest to address these limitations?
3. Relate this paper to the following concepts that we have discussed throughout the term: context-free
grammar, natural language generation and extractive summarization.
4. Three questions related to the paper. These can be clarification questions, or questions about
potential extensions of the paper, or its relationship to other work.
What To Submit
Before the deadline: You are encouraged to do a preliminary pass of the reading before November 30th,
when we will discuss the paper in class.
Electronically: Submit the written portions of the assignment in a single pdf file called ‘a4-written.pdf’.
For the programming part of Question 1, you should submit one zip file called ‘a4-q1.zip’ with your source
code, input document clusters, and output summaries. Both should be submitted to MyCourses under
Assignment 4.
Page 2