## Description

In this assignment you will train an RNN to synthesize English text character

by character. You will train a vanilla RNN with outputs, as described in

lecture 9, using the text from the book The Goblet of Fire by J.K. Rowling.

The variation of SGD you will use for the optimization will be AdaGrad.

The final version of your code should contain these major components:

• Preparing Data: Read in the training data, determine the number

of unique characters in the text and set up mapping functions – one

mapping each character to a unique index and another mapping each

index to a character.

• Back-propagation: The forward and the backward pass of the backpropagation algorithm for a vanilla RNN to efficiently compute the

gradients.

• AdaGrad updating your RNN’s parameters.

• Synthesizing text from your RNN: Given a learnt set of parameters for the RNN, a default initial hidden state h0 and an initial input

vector, x0, from which to bootstrap from then you will write a function

to generate a sequence of text.

## Background 1: A Vanilla RNN

The mathematical details of the RNN you will implement are as follows.

Given a sequence of input vectors, x1, . . . , xτ , where each xt has size d × 1

and an initial hidden state h0, the RNN outputs at each time-step t a vector

of probabilities, pt (K × 1), for each possible character and a hidden state

ht+1 for size m × 1.

That is

for t = 1, 2, . . . , τ

at = W ht−1 + Uxt + b (1)

ht = tanh(at) (2)

ot = V ht + c (3)

pt = SoftMax(ot) (4)

The loss for a single labelled sequence is the sum of the cross-entropy loss

for each

L(x1:τ , y1:τ , Θ) = Xτ

t=1

lt = −

Xτ

t=1

log(y

T

t pt) (5)

1

where Θ = {b, c, W, U, V }, x1:τ = {x1, . . . , xτ } and y1:τ is defined similarly.

The equations for the gradient computations of the back-propagation algorithm for such an RNN are given in Lecture 9. Note in the lecture notes the

bias vectors have been omitted. It is left as an exercise for you to compute

the gradient w.r.t. the two bias vectors.

## Background 2: AdaGrad algorithm for optimization

In this assignment you will implement the variant of SGD called AdaGrad.

To refresh your memory the update steps for AdaGrad are defined as

mθ,t0 = mθ,t0−1 + g

2

t

0 (6)

θt

0+1 = θt

0 −

η

√mθ,t0 +

gt

0 (7)

where

• θ is a generic place holder for the parameter vector/matrix under

consideration,

• t

0

refers to the iteration of the SGD update (not to be confused with the

t used to denote the input and output vectors of the labelled training

sequence),

• gt

0 is the gradient vector ∂L

∂θ

and

• in an abuse of notation the operations of division, raising to the power

of two and square root are applied to each entry of the vector/matrix

independently.

(You could also try RMSProp updates where

mθ,t0 = γmθ,t0−1 + (1 − γ) g

2

t

0 (8)

θt

0+1 = θt

0 −

η

√mθ,t0 +

gt

0 (9)

Common values settings for γ and η are .9 and .001 respectively. However, I

found for this update step there are more problems with exploding gradients

and you have to be more careful with how you clip your gradients.)

## Exercise 1: Implement and train a vanilla RNN

In the following I will sketch the different parts you will need to write to

complete the assignment. Note it is a guideline. You can, of course, have a

different design, but you should read the outline to help inform how different

parameters and design choices are made.

0.1 Read in the data

First you need to read in the training data from the text file of The Goblet

of Fire available for download at the Canvas webpage. To save you some

time here is code that will read in the contents of this text file.

book fname = ’data/Goblet.txt’;

fid = fopen(book fname,’r’);

book data = fscanf(fid,’%c’);

fclose(fid);

All the characters of the book are now in the vector book data. To get a

vector containing the unique characters in book data apply the Matlab function unique. Once you have this list, which we will denote by book chars,

then its length K corresponds to the dimensionality of the output (input)

vector of your RNN.

To allow you to easily go between a character and its one-hot encoding and

in the other direction you should initialize map containers of the form:

char to ind = containers.Map(’KeyType’,’char’,’ValueType’,’int32’);

ind to char = containers.Map(’KeyType’,’int32’,’ValueType’,’char’);

Then for char to ind you should fill in the characters in your alphabet as its

keys and create an integer for its value (keep things simple and use where the

character appears in the vector book chars as its value). And similarly for

ind to char fill in the integers 1 to K as its keys and assign the appropriate

character value for each integer.

You will use these map containers when

you convert a sequence of characters into a sequence of vectors of one-hot

encodings and then when you convert a synthesized sequence of one-hot

encodings back into a sequence of characters.

0.2 Set hyper-parameters & initialize the RNN’s parameters

The one hyper-parameter you need to define the RNN’s architecture is the

dimensionality of its hidden state m. For this assignment you should set

m=100.

The other hyper-parameters you need to set are those associated

with training and these are the learning rate eta and the length of the

input sequences (seq length) you use during training. Here are the default

settings for this assignment eta=.1 and seq length=25.

In my code I found it easiest to store the parameters of the model in an

object called RNN. I initialized the bias vectors RNN.b and RNN.c to be zero

vectors of length m×1 and K×1. Note for this task the dimensionality of

the input and output vectors are the same. While the weight matrices are

randomly initialized as

RNN.U = randn(m, K)*sig;

RNN.W = randn(m, m)*sig;

RNN.V = randn(K, m)*sig;

where I set sig = .01.

0.3 Synthesize text from your randomly initialized RNN

Before you begin training your RNN, you should write a function that will

synthesize a sequence of characters using the current parameter values in

your RNN.

Besides RNN, it will take as input a vector h0 (the hidden state

at time 0), another vector x0 which will represent the first (dummy) input

vector to your RNN (it can be some character like afull-stop), and an integer

n denoting the length of the sequence you want to generate. In the body of

the function you will write code to implement the equations (1-4). There is

just one major difference – you have to generate the next input vector xnext

from the current input vector x.

At each time step t when you generate a

vector of probabilities for the labels, you then have to sample a label (i.e. an

integer) from this discrete probability distribution. This sample will then

be the (t + 1)th character in your sequence and will be the input vector for

the next time-step of your RNN.

Here is one way to randomly select a character based on the output probability scores p:

cp = cumsum(p);

a = rand;

ixs = find(cp-a >0);

ii = ixs(1);

First you compute the vector containing the cumulative sum of the probabilities. Then you generate a random draw, a, from a uniform distribution

in the range 0 to 1. Next you find the index 1≤ii≤K such that cp(ii-1)

≤ a ≤ cp(ii) where we assume for notational convenience cp(0)=0.

You

should store each index you sample for 1≤t≤n and let your function output

the matrix Y (size K×n) where Y is the one-hot encoding of each sampled

character. Given Y you can then use the map container ind to char to

convert it to a sequence of characters and view what text your RNN has

generated.

0.4 Implement the forward & backward pass of back-prop

Next up is writing the code to compute the gradients of the loss w.r.t. the

parameters of the model. While you write this code, you should use the first

seq length characters of book data as your labelled sequence for debugging

that is

X chars = book data(1:seq length);

Y chars = book data(2:seq length+1);

Note the label for an input character is the next character in the book.

Once you have X chars and Y chars, you then have to convert them to the

matrices X and Y containing the one-hot encoding of the characters of the

sequence.

Both X and Y have size K×seq length and each column of the

respective matrices corresponds to an input vector and its target output

vector. You should also set h0 to the zero vector. Given this labelled

sequence and initial hidden state you are in a position to write and call a

function that performs the forward-pass of the back-prop algorithm.

This

function should apply the equations (1-4) to the input data just described

and return the loss and also the final and intermediary output vectors at

each time step needed by the backward-pass of the algorithm.

Once you have computed the forward-pass then the next step is to write the

code for the backward pass of the back-prop algorithm. Here you should

implement the equations given in Lecture 9.

One piece of advice is that you should store each of the computed gradients

in an object (with the same names as for your RNN object), such as grads.W

etc. This will allow you to write more streamlined code to check your analytical gradients against their numerical counterparts and to implement the

AdaGrad updates.

This is because you can access the names within an object easily, so for instance, to perform a vanilla SGD update step for all the

parameters the code would look something like:

for f = fieldnames(RNN)’

RNN.(f{1}) = RNN.(f{1}) – eta * grads.(f{1});

end

After you have written the code to compute the forward and backward pass,

you then have to, as per usual check your gradient computations numerically.

On the Canvas website I have provided a Matlab function that computes the

gradients numerically. The function assumes you have stored the parameters

of your RNN in an object called RNN and also it calls a function ComputeLoss

that computes the loss for an input sequence of length 25 stored in a matrix X

of size K×25 and corresponding ground truth output matrix, Y of size K×25,

that has the label for each input in the sequence.

The numerical gradient

computations also sets h0 to be zero for each gradient computation. In my

code I set the step size for the numerical computations to h=1e-4 and I get

a max relative error of around 8.5344e-06 (for RNN.V) when I set m=5. If

you increase the value of m then you will need to increase the value of sig

used in the parameter initialization to combat numerical precision issues.

Once you are sure your gradient computations are correct then you should

clip your gradients to avoid the exploding gradient problem. Something like

this should work:

for f = fieldnames(grads)’

grads.(f1) = max(min(grads.(f1), 5), -5);

end

0.5 Train your RNN using AdaGrad

You are now ready to write the high-level loop to train your RNN with the

text in book data. The general high-level approach will be as follows. Let e

(initialized to 1) be the integer that keeps track of where in the book you are.

At each iteration of the SGD training you should grab a labelled training

sequence of length seq length characters. Thus your sequence of input

characters corresponds to book data(e:e+seq length-1) and the labels for

this sequence is book data(e+1:e+seq length). You should convert these

sequence of characters into the matrices X and Y (the one-hot encoding

vectors of each character in the input and output sequences).

However, before you pass this labelled sequence into your forward and backward functions you also need to define hprev. If e=1 then hprev should be

the zero vector while if e>1 then hprev should be set to the last computed

hidden state by the forward pass in the previous iteration. Thus (hopefully)

you have a hprev that has stored the context of all the prior characters it

has seen so far in the book! Now you have all the inputs needed for the

forward and backward pass functions to compute the gradient. Once you

have computed the gradients then you can apply the AdaGrad update step

to all the parameters of your RNN.

Your forward pass function should also return the loss for the labelled training sequence. As we are implementing SGD the loss from one training

sequence to the next will vary alot and also it is too expensive to compute

the loss of the entire training data, it useful to keep track of a smoothed

version of the loss over the iterations with a weighted sum of the smoothed

loss and the current loss such as:

smooth loss = .999* smooth loss + .001 * loss;

You should print out smooth loss regularly (say after every 100th update

step) to see if the smoothed loss is, in general, reducing. What I found is

that learning is initially very fast and then it slows. After the 1st epoch

learning is much slower and you can see the smoothed low going up and

down according to which part of the novel is harder or easier to predict,

but at corresponding points in the novel there is a general trend for the

smoothed loss to get gradually smaller. (Note the lower values I saw for the

smooth loss (∼ 7th epoch of training, 300,000 update steps) were ∼39.

You will probably reach this loss value at a much earlier stage of training

though maybe not as consistently throughout the text.)

You should also synthesize text (of length around 200 characters) from your

RNN regularly (say after every 500th update step) during training (you can let

out a shout of hurrah when you see your first synthesized Harry, Hermione,

Dumbledore, or . . .).

This allows you to see if your training is doing something sensible. You can do this by calling your function where h0 is the same

hprev as used in the forward pass and x0 is X(:, 1) (the first character of

the labelled input sequence for the current iteration).

At the end of an update step you should then increase your the counter e

by seq length. If this results in e>length(book data)-seq length-1 then

you should reset e to be 1 and loop through the characters in the book

again. When you reset e you have completed one epoch of training. Also

when you reset e to 1 you should also reset hprev to the zero-vector.

To help you debug here are snapshots of text sequences I generated at the

different stages of my training:

iter = 1, smooth loss=109.236

C’:}3 By/KOFOeUOyb eD6GJ_yq ^oQIf.pFHWZaz(rPD2c,:CVjdAg);Q!?x3eqS9qyDme;g)L-XUo}t Cg }F.Bz3wEXO!Yqs aiZ -wZwfZ

)eLy}t/e)w-}b7MCt7Mu^GQ(/SFB”9nx

exEoA/ 3Hn7lxcsuQD/^SDKn;K/EsRKjEH b’Q:cR-wWzrp

dUs-zd!

iter = 1000, smooth loss=86.412

inntho hordcad ghhoke cohe Ho Haou oban? nuithicwtcmar thim Brrtorter ighatulf aid thane ci,. aog goed ait Pmhath ae he cetheag Wel land toinn PhTcouditer = 4000, smooth loss=60.063

olryy. He qalice men?”

bigpoud.

Souf’gh ait.”d ath asing. Har or ande, the fiop “”I. “Seang,.

.

bage; fomtef pood Dotly, Mv. Houndny.

“.”

horandy, bay I or.”Woom Mroighin Hasrotvart fowhertent. Hadry

iter = 30000, smooth loss=47.612:

iless youd ane he parefins that ware Dues are Ske he flrars out.. Then , Harry in to, tham ham was lly doue – feepemer lacply into meach that preen aiter = 363000, smooth loss=39.450996:

ore was and Encorks.

“Eut shave thinks his fist theaghed of Harry.

“Mr. If ack,” said Ron beroncy, no tischen want they strating Serming. “De Dragry,” said Ron. I would to wizards spmetwer scrough t

To complete the assignment:

To pass the assignment you need to upload:

1. The code for this assignment.

2. A brief pdf report with the following content:

i) State how you checked your analytic gradient computations and

whether you think that your gradient computations are bug free

for your RNN.

ii) Include a graph of the smooth loss function for a longish training

run (at least 2 epochs).

iii) Show the evolution of the text synthesized by your RNN during

training by including a sample of synthesized text (200 characters

long) before the first and before every 10,000th update steps when

you train for 100,000 update steps.

iv) A passage of length 1000 characters synthesized from your best

model (the one that achieved the lowest loss).

## Exercise 2: Optional for bonus points

1. Synthesize Donald Trump tweets instead of Harry Potter

It would be fun to take the same approach in the assignment and apply

it to generating tweets. Here we have the constraint that each sequence

can be at most 140 characters long. I haven’t done this myself but the

adjustments I would initially make to the assignment implementation

would be

• Add an end-of-tweet character to your set of possible characters.

• When you synthesize text have a hard limit of 140 characters.

• During training I would re-initialize hprev to its default value

after each training tweet.

• Maybe play around with the length of the sequence during training.

Here is a link to a repository of his tweets trump tweet data archive

where you can download them from different years and use them for

training.

Bonus Points Available: 3 points if you are able to generate tweets that

bare some resemblence to a Donald Trump tweet. This is, of course,

subjective but I will not be too harsh in my judgments. I think if your

RNN generates sentences with some of his favourite phrases bad, fake

news, loser, . . . that should be some indication that your RNN is on

its way to emulating the “Donald”.

To get the bonus point you must submit

(a) Your code.

(b) A pdf document reporting briefly what changes (beyond cosmetic

trivial ones) to the assignment implementation you made and

containing synthesized tweets at different stages during training.