## Description

Neural Language Model

In this assignment, we will train a multimodal log bilinear language model. In particular, we will

deal with a dataset which contains data of two modalities, i.e., image and text. An instance of

the dataset consists of an image and several associated sentences. Each sentence is a so-called

caption of the image which describe its content. The overall goal of the neural language model is

to generate the caption given an image. Note that a caption (sentence) is generated word by word

conditioned on both the image and a fixed size context. The context of the word just means a

fixed-size contiguous sequence of words ahead of it.

To understand the model, we first clarify the notations. An image is represented as a feature

vector x 2 RM. A sentence is a sequence of words where each word w in the vocabulary is

represented as a D-dimensional real valued vector rw 2 RD. Let R denote the K ⇥ D matrix of

word representation vectors where K is the vocabulary size.

We now describe the multimodal log bilinear language model which is slightly di↵erent from the

one described in section 4.1 of [4]. Given the image feature, we first predict the word representation

as follows,

ˆr =

n

X1

i=1

C(i)

rwi

!

+ C(m)

g (Jx + h), (1)

where n is the context size and is a hyperparameter of the model. rwi is the representation of i-th

word, i.e., wi-th row of matrix R. J and h are weight matrix and bias vector respectively. g is

the Rectified Linear Unit (ReLU) [3] function. {C(i)

|i = 1, …, n 1} are D ⇥ D context matrices

1https://markus.teach.cs.toronto.edu/csc321-2017-01

2http://www.cs.toronto.edu/~rgrosse/courses/csc321_2017/syllabus.pdf

1

CSC321 Winter 2017 Programming Assignment 2

of text modality. C(m) is a D ⇥ M context matrix of image modality. The conditional probability

P(wn|w1,…,wn1, x) is then computed as

P(wn|w1,…,wn1, x) = exp(ˆr>ri + bi)

PK

j=1 exp(ˆr>rj + bj )

, (2)

where bi is a bias vector associated with the i-th word. The overall set of parameters to be learned

is summarized as {J, C(i)

, C(m)

, h, bi, R|i = 1,…,n}.

Based on the above conditional probability, we use beam search to generate the caption. To

briefly explain how it works, we assume the vocabulary size and beam width are V and N respectively. For the first word of the caption, we sort the vocabulary in descending order based on their

conditional probabilities and take the first N as candidates. For the second one, we compute the

conditional probabilities given the first word is one of the N candidates from last step. In this way,

we will have V ⇥ N possible sequences of length 2. We then take the first N sequences of which

the probabilities are top N. This procedure is repeatedly applied until the generated sequences are

of desired length.

We train the model using the cross-entropy criterion, which is equivalent to maximizing the

probability it assigns to the targets in the training set. Hopefully it will also learn to make sensible

predictions for sequences it hasn’t seen before.

Dataset

Download and extract the archive from the CSC321 homework page http://www.cs.toronto.edu/

~rgrosse/courses/csc321_2017/. The dataset inside the archive is a randomly sampled subset

of the Microsoft COCO dataset (http://mscoco.org/). It contains 1000, 250 and 250 instances

for training, validation and testing respectively.

For each image, we provide the hidden representation of last fully connected layer of VGG-16

as the feature vector which is of size 4096. Feature vectors are stored in:

/data/train.npy

/data/val.npy

/data/test.npy

The corresponding image lists are stored in:

/data/train_img_id.txt

/data/val_img_id.txt

/data/test_img_id.txt

The associated sentences are stored in:

/data/sentences_coco_train.json

/data/sentences_coco_val.json

We also provide initial word embedding, i.e., initial values of R, in the file /data/word_embeddings.p.

The size of word embedding is 100. The vocabulary consists of 858 words.

2

CSC321 Winter 2017 Programming Assignment 2

Questions

Visualize Word Embedding (10%)

To get the feel of what word embeddings are, you can use t-SNE [2] to visualize these high dimensional vectors in a 2-dimensional space. We already provide the implementation in the file

visual_word_embed.py. By running it, you can find how the words are distributed. In this question, you are asked to pick up some subset of the whole vocabulary and explain what you have

found inside the visualization (Hint: you can select words such as nouns and verbs which can easily

form clusters in the visualization). Pick up several pairs of words, e.g., (man, woman), (man, sugar)

and then compute the cosine similarity between word embeddings of each pair. Explain what you

have found from these similarity scores.

Implement Forward Pass of Model (40%)

In this question, you will be asked to implement the forward pass of neural language model. Specifically, looking into the mlbl.py, you should fill in the blanks of the member function forward based

on your understanding of the model. In the code file your implementation should follow after the

comment as below.

########################################################################

# You should implement forward pass here!

########################################################################

In terms of programming, forward pass only involves basic NumPy operations. Since we will

use autograd for the backward pass of the model, you will have to use the customized NumPy

of autograd for the forward pass. You should be careful with the restrictions of the customized

NumPy operations. For example, use np.dot(A, B) instead of A.dot(B). Please refer to the

tutorial of autograd (https://github.com/HIPS/autograd/blob/master/docs/tutorial.md) for

more details.

Implement Backward Pass of Model (20%)

In this question, you will be asked to implement the backward pass of neural language model.

Looking into the mlbl.py, you should fill in the blanks of the member function backward based on

your understanding of the model. You will use grad function of autograd for this question. Please

refer to the document of autograd for detailed instructions. In the code file your implementation

should follow after the comment as below.

########################################################################

# You should implement backward pass here!

########################################################################

Train and Evaluate Model (30%)

Once you correctly implemented the forward and backward passes, you can train the model by

simply running run_trainer.py. During training, the program will periodically open a webpage

which displays 2 example validation images and the corresponding generated captions. All hyperparameters, e.g., learning rate, momentum, are set in the file trainer.py. You can play around

with di↵erent values of hyperparameters to see what happens.

3

CSC321 Winter 2017 Programming Assignment 2

We use BLEU [1] score to quantitatively evaluate our generated caption against the ground

truth caption. By training the model for around 15 to 20 minutes, you should be able to see some

sensible captions and get around 6.0 bleu score on the validation set. Note the generated word

’unk’ means unknown which is a dummy element added into the vocabulary.

To test the model, you can just run run_tester.py. Since we only use a subset of the whole

dataset and model is fairly simple, the performance is far from perfect. However, it should be able

to generate some reasonable sentences.

In this question, you will be asked to:

• Try di↵erent choices of learning rate and momentum and compare the corresponding validation performance. You should at least experiment with two di↵erent values of learning

rate and momentum respectively (totally 4 settings including the one already in the code).

Show the curve of BLEU scores on the validation set during training (Hint: you can change

prog[’_bleu’] in the trainer.py to modify how often the program computes the BLEU

score). Explain why one set of hyperparameters might work better than another.

• Choose the best model based on the validation performance. Test the model and report the

BLEU score on the testing set.

• To understand what the model has learned, you can run run_interpreter.py to obtain the

top-K probable candidates out of the whole vocabulary for each word in the sentence, where

K is the beam width. You can tweak the beam width and explain what you found from the

candidate words. For example, you may find that word ’am’ follows word ’I’ at a very high

probability.

• Look through the results of good or bad captions and trying to explain why they occurred.

What To Submit

For reference, here is everything you need to hand in:

• A PDF file a2-writeup.pdf, typeset using LATEX, containing your answers to the conceptual

questions.

• Your implementation of mlbl.py.

References

[1] Papineni, K., Roukos, S., Ward, T. and Zhu, W.J., 2002, July. BLEU: a method for automatic

evaluation of machine translation. In Proceedings of the 40th annual meeting on association for

computational linguistics (pp. 311-318). Association for Computational Linguistics.

[2] Maaten, L.V.D. and Hinton, G.E., 2008. Visualizing data using t-SNE. Journal of Machine

Learning Research, 9(Nov), pp.2579-2605.

[3] Krizhevsky, A., Sutskever, I. and Hinton, G.E., 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097-1105).

[4] Kiros, R., Salakhutdinov, R. and Zemel, R.S., 2014, June. Multimodal Neural Language Models.

In International conference on machine learning (Vol. 14, pp. 595-603).

4

CSC321 Winter 2017 Programming Assignment 2

[5] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollr, P. and Zitnick,

C.L., 2014, September. Microsoft coco: Common objects in context. In European Conference

on Computer Vision (pp. 740-755). Springer International Publishing.

5