COMS W4705 Homework 1: Building a Trigram Language Model solution




5/5 - (3 votes)

In this assignment you will work on a lexical substitution task, using both WordNet and pretrained Word2Vec word embeddings. This task was first proposed as a shared task
at SemEval 2007 Task 10. In this task, the goal is to find lexical substitutes for individual
target words in context. For example, given the following sentence:
“Anyway , my pants are getting tighter every day .” The goal is to propose an alternative
word for tight, such that the meaning of the sentence is preserved. Such a substitute could
be constricting, small or uncomfortable.
In the sentence
“If your money is tight don’t cut corners .” the substitute small would not fit, and instead
possible substitutes include scarce, sparse, constricted. You will implement a number of basic
approaches to this problem. You also have the option to improve your solution and enter
your approach into a course-wide competition (see part 6 below). Participating in the
competition does not affect grading.
Prequisites: Installing neccessary packages
The standard way to access WordNet in Python is now NLTK, the Natural Language Toolkit.
NLTK contains a number of useful resources, such as POS taggers and parsers, as well as
access to several text corpora and other data sets. In this assignment you will mostly use its
WordNet interface. To install NLTK, please follow the setup instructions here. If you use a
package manager (for example, most Linux distributions or macports or homebrew on mac)
you might want to install the package for NLTK instead. For example (in macports)
$ sudo port search nltk
py-nltk @3.0.4 (python, textproc)
Natural Language Toolkit
py27-nltk @3.0.4 (python, textproc)
Natural Language Toolkit
py34-nltk @3.0.4 (python, textproc)
Natural Language Toolkit
Found 3 ports.
$ sud port install py34-nltk
Once you have installed NLTK, you need to download the WordNet data. Run a Python
intepreter and then
$ python
Python 3.6.1 |Anaconda 4.4.0 (x86_64)| (default, May 11 2017, 13:04:09)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin
Type “help”, “copyright”, “credits” or “license” for more information.
>>> import nltk
This will open up a new window that lets you select add-on packages (data and models) for
NLTK. Switch to the corpora tab and select the “wordnet” package. While you are here, also
install the English stopword list in the “stopwords” package.
If you have trouble installing the data, please take a look at the documentation here. Next,
test your WordNet installation:
>>> from nltk.corpus import wordnet as wn
>>> wn.lemmas(‘break’, pos=’n’) # Retrieve all lexemes for the noun ‘break’
[Lemma(‘interruption.n.02.break’), Lemma(‘break.n.02.break’), Lemma(‘fault.n.04.break’), Lemma(‘
eak’), Lemma(‘respite.n.02.break’), Lemma(‘breakage.n.03.break’), Lemma(‘pause.n.01.break’), Lemma(‘fracture.n.
01.break’), Lemma(‘break.n.09.break’), Lemma(‘break.n.10.break’), Lemma(‘break.n.11.break’), Lemma(‘break.n.12
.break’), Lemma(‘break.n.13.break’), Lemma(‘break.n.14.break’), Lemma(‘open_frame.n.01.break’), Lemma(‘break.
>>> l1 = wn.lemmas(‘break’, pos=’n’)[0]
>>> s1 = l1.synset() # get the synset for the first lexeme
>>> s1
>>> s1.lemmas() # Get all lexemes in that synset
[Lemma(‘interruption.n.02.interruption’), Lemma(‘interruption.n.02.break’)]
>>> s1.lemmas()[0].name() # Get the word of the first lexeme
>>> s1.definition()
‘some abrupt occurrence that interrupts an ongoing activity’
>>> s1.examples()
[‘the telephone is an annoying interruption’, ‘there was a break in the action when a player was hurt’]
>>> s1.hypernyms()
>>> s1.hyponyms()
[Synset(‘dislocation.n.01’), Synset(‘eclipse.n.01’), Synset(‘punctuation.n.01’), Synset(‘suspension.n.04’)]
>>> l1.count() # Occurence frequency of this sense of ‘break’ in the SemCor corpus.
Gensim is a vector space modeling package for Python. While gensim includes a complete
implementation of word2vec (among other approaches), we will use it only to load existing
word embeddings. To install gensim, try
pip install gensim
Or use your package manager. You can find more detailed installation instructions here. In
most cases, installing gensim will automatically install numpy and scipy as dependency.
These are numerical and scientific computing packages for Python. If not, you can take a
look at the installation instructions
pre-trained Word2Vec embeddings
Finally, download the pre-trained word embeddings for this project here GoogleNewsvectors-negative300.bin.gz (Warning: 1.5GB file). These embeddings were trained using a
modified skip-gram architecture on 100B words of Google News text, with a context window
of +-5. The word embeddings have 300 dimensions.
You can test your gensim installation by loading the word vectors as follows.
>>> import gensim
>>> model = gensim.models.KeyedVectors.load_word2vec_format(‘./GoogleNews-vectors-negative300.bin.gz’, bin
This will take a minute or so. You can then obtain the vector representation for individual
words like this:
>>> v1 = model.wv[‘computer’]
>>> v1
array([ 1.07421875e-01, -2.01171875e-01, 1.23046875e-01,
2.11914062e-01, -9.13085938e-02, 2.16796875e-01,
-1.31835938e-01, 8.30078125e-02, 2.02148438e-01,
4.78515625e-02, 3.66210938e-02, -2.45361328e-02,
2.39257812e-02, -1.60156250e-01, -2.61230469e-02,
9.71679688e-02, -6.34765625e-02, 1.84570312e-01,
1.70898438e-01, -1.63085938e-01, -1.09375000e-01,
>>> len(v1)
You can also use gensim to compute the cosine similarity between two word vectors:
>>> model.similarity(‘computer’,’calculator’)
>>> model.similarity(‘computer’,’toaster’)
>>> model.similarity(‘computer’,’dog’)
>>> model.similarity(‘computer’,’run’)
Alternatively, you can use numpy to compute cosine similarity yourself. Recall that cosine
distance is defined as: cos(u,v) = (u · v) / (|u| |v|)
>>> import numpy as np
>>> def cos(v1,v2):
… return,v2) / (np.linalg.norm(v1)*np.linalg.norm(v2))
>>> cos(model.wv[‘computer’],model.wv[‘calculator’])
Also not necessary for this assignment, you can find some additional information about basic
usage of numpy here.
Getting Started
Please download the files and scaffolding needed for the lexical substitution project The archive contains the following files:
• lexsub_trial.xml – input trial data containing 300 sentences with a single target word
• gold.trial – gold annotations for the trial data (substitues for each word suggested by 5
• – an XML parser that reads lexsub_trial.xml into Python objects.
• – this is the main scaffolding code you will complete
• – the scoring script provided for the SemEval 2007 lexical substitution task.
You will complete the file and you should not have to touch any of the other
files. You should, however, take a look at lexsub_trial.xml and The function
read_lexsub_xml(*sources) in reads the xml data and returns an iterator over
Context objects. Each Context object corresponds to one target token in context. The
instance variables of Context are as follows:
• cid – running ID of this instance in the input file (needed to produce the correct output
for the scoring script).
• word_form – the form of the target word in the sentence (for example ‘tighter’).
• lemma – the lemma of the target word (for example ‘tight’).
• pos – this can be either ‘n’ for noun, ‘v’ for verb, ‘a’, for adjective, or ‘r’ for adverb.
• left_context – a list of tokens that appear to the left of the target word. For example
[‘Anyway’, ‘,’, ‘my’, ‘pants’, ‘are’, ‘getting’]
• right_context – a list of tokens that appear to the right of the target word. For example
Take a look at the main section of to see how to iterate through the Contexts
in an input file. Running the program with a .xml annotation file as its parameter will just
print out a representation for each context object.
$ python lexsub_trial.xml
<Context_1/bright.a During the siege , George Robertson had appointed Shuja-ul-Mulk , who was a *bright* boy o nly 12 years old and the youngest surviving son of Aman-ul-Mulk , as the ruler of Chitral .>
<Context_2/bright.a The actual field is not much different than that of a 40mm , only it is smaller and quite a bit no ticeably *brighter* , which is probably the main benefit .>

Next, take a look at the file The main section of that file loads the XML file,
calls a predictor method on each context, and then print output suitable for the SemEval
scoring script. The purpose of the predictor methods is to select an appropriate lexical
substitute for the word in context. The method that is currently being called
is smurf_predictor(context). This method simply suggests the word smurf as a substitute for all
target words. You can run and redirect the output to a file.
$ python lexsub_trial.xml > smurf.predict
$ head smurf.predict # print the first 10 lines of the file
bright.a 1 :: smurf
bright.a 2 :: smurf
bright.a 3 :: smurf
bright.a 4 :: smurf
bright.a 5 :: smurf
bright.a 6 :: smurf
bright.a 7 :: smurf
bright.a 8 :: smurf
bright.a 9 :: smurf
bright.a 10 :: smurf
The output indicates that the predictor selected the word smurf for the adjective bright in
context 1, etc. You can then run the scoring script (which is written in perl) on the predict
$ perl smurf.predict gold.trial
Total = 298, attempted = 298
precision = 0.000, recall = 0.000
Total with mode 206 attempted 206
precision = 0.000, recall = 0.000
Unsurprisingly, the smurf predictor does not perform well. Some clarifications:
• The return value of the predictor methods is a single string containing a lemma. The
word does not have to be inflected in the same way as the original word form that is
• The original SemEval task allows multiple predictions and contains an “out of 10”
evaluation (accounting for the fact that this task is difficult for human annotators too).
For this assignment, we are limiting ourselves to predicting only a single substitute.
Part 1: Candidate Synonyms from WordNet (18 pts)
Write the function get_candidates(lemma, pos) that takes a lemma and part of speech
(‘a’,’n’,’v’,’r’) as parameters and returns a set of possible substitutes. To do this, look up the
lemma and part of speech in WordNet and retrieve all synsets that the lemma appears in.
Then obtain all lemmas that appear in any of these synsets. For example,
>>> get_candidates(‘slow’,’a’)
{‘deadening’, ‘tiresome’, ‘sluggish’, ‘dense’, ‘tedious’, ‘irksome’, ‘boring’, ‘wearisome’, ‘obtuse’, ‘dim’, ‘dumb’, ‘dull’, ‘
Make sure that the output does not contain the input lemma itself. The output can contain
multiword expressions such as “turn around”. WordNet will represent such lemmas as
“turn_around”, so you need to replace the _.
Part 2: WordNet Frequency Baseline (18 pts)
Write the function wn_frequency_predictor(context) that takes a context object as input and
predicts the possible synonym with the highest total occurence frequency (according to
WordNet). Note that you have to sum up the occurence counts for all senses of the word if
the word and the target appear together in multiple synsets. You can use the get_candidates
method or just duplicate the code for finding candidate synonyms (this is possibly more
convenient). Using this simple baseline should give you about 10% precision and recall. Take
a look at the output to see what kinds of mistakes the system makes.
Part 3: Simple Lesk Algorithm (18 pts)
Implement the function wn_simple_lesk_predictor(context). This function uses Word Sense
Disambiguation (WSD) to select a synset for the target word. It should then return the most
frequent synonym from that synset as a substitute. To perform WSD, implement the simple
Lesk algorithm. Look at all possible synsets that the target word apperas in. Compute the
overlap between the definition of the synset and the context of the target word. You may
want to remove stopwords (function words that don’t tell you anything about a word’s
semantics). You can load the list of English stopwords in NLTK like this:
stop_words = stopwords.words(‘english’)
The main problem with the Lesk algorithm is that the definition and the context do not
provide enough text to get any overlap in most cases. You should therefore add the following
to the definition:
• All examples for the synset.
• The definition and all examples for all hypernyms of the synset.
Even with these extensions, the Lesk algorithm will often not produce any overlap. If this is
the case (or if there is a tie), you should select the most frequent synset (i.e. the Synset with
which the target word forms the most frequent lexeme, according to WordNet). Then select
the most frequent lexeme from that synset as the result. One sub-task that you need to solve
is to tokenize and normalize the definitions and exmaples in WordNet. You could either look
up various tokenization methods in NLTK or use the tokenize(s) method provided with the
code. In my experiments, the simple lesk algorithm did not outperform the WordNet
frequency baseline.
Part 4: Most Similar Synonym (18 pts)
You will now implement approaches based on Word2Vec embeddings. These will be
implemented as methods in the class Word2VecSubst. The reason these are methods is that
the Word2VecSubst instance can store the word2vec model as an instance variable. The
constructor for the class Word2VecSubst already includes code to load the model. You may
need to change the value of W2VMODEL_FILENAME to point to the correct file.
Write the method predict_nearest(context) that should first obtain a set of possible
synonyms from WordNet (either using the method from part 1 or you can rewrite this code
as you see fit), and then return the synonym that is most similar to the target word,
according to the Word2Vec embeddings. In my experiments, this approach worked slightly
better than the WordNet Frequency baseline and resulted in a precision and recall of about
Part 5: Context and Word Embeddings (18 pts)
In this part, you will implement the method predict_nearest_with_context(context). One
problem of the approach in part 4 is that it does not take the context into account. Like the
model in part 2, it ignores word sense. There are many approaches to model context in
distributional semantic models. For now, we will do something very simple. First create a
single vector for the target word and its context by summing together the vectors for all
words in the sentence, obtaining a single sentence vector. Then measure the similarity of the
potential synonyms to this sentence vector. This works better if you remove stop-words and
limit the context to +-5 words around the target word. In my experiments, this approach
resulted in a precison and recall of about 12%.
Part 6: Other ideas? (and competition) (10 pts)
By now you should have realized that the lexical substitution task is far from trivial. In this
part, you should implement your own refinements to the approaches proposed above, or
even a completely different approach. Any small improvement will do to get credit for part
6. When you submit the project, running should run your best predictor for
this problem (i.e. you can name that predictor function anything you like).