## Description

1 n-gram models

Ex 1.

a) Load and tokenize the text attached ’Plato_Republic.txt’.

Put all the words in lower case to regroup words like ’The’ and ’the’.

Compute the total number of words N in the text and the number of unique words

(size of the vocabulary).

b) Build a uni-gram. Deduce the 5 most common words with at least 8 characters.

Hint: use the method ’most_common’ on an object ’nltk.FreqDist’.

c) Build a bi-gram and define a function that given two words (x1, x2) compute the

probability:

P(x2|x1) = #{(x1, x2)}

#{x1}

where # denotes the number of occurences of the word (or pair of words) in the

corpus.

d) Deduce the so-called perplexity of the bi-gram model defined as:

P P =

Y

k=1..(N−1)

P(xk+1|xk)

− 1

N−1

where N denotes the total number of words in the corpus.

2 Recurrent Neural Networks

Ex 2.

The goal of this exercise is to experiment with a simple Recurrent Neural Network

(RNN) model for predicting letters. We only consider four letters ”h”, ”e”, ”l” and ”o”

that we embed in R

4

:

”h” →

1

0

0

0

, ”e” →

0

1

0

0

, ”l” →

0

0

1

0

, ”o” →

0

0

0

1

.

1

We consider a RNN with hidden states ht

in R

2

:

(

ht = tanh(Rht−1 + Axt)

yt = Bht

(1)

where A ∈ M2,4(R), R ∈ M2,2(R) and B ∈ M4,2(R) (e.g. A is a 2 × 4 matrix).

a) Given the input ”hello” (i.e. x1 = (1, 0, 0, 0), . . . , x5 = (0, 0, 0, 1)), the initial state

h0 = (0, 0) and the matrices:

A =

”

1 −1 −1/2 1/2

1 1 −1/2 −1

#

, R =

”

1 0

0 1 #

, B =

1 1

1/2 1

−1 0

0 −1/2

,

find the output y1, . . . , y5 and deduce the predicted characters (see figure 1).

b) Find matrices A, R, B such that the predicted characters are ”olleh”.

“h” “e” “l” “l” “o”

A

B

R

(0,0)

A

B

A

B

A

B

A

B

? ? ? ? ?

embedding

decoding

Figure 1: Predictions of a vanilla RNN. After encoding the letters (e.g. “h”) into vectors

(e.g. x1 = (1, 0, 0, 0)), the network performs the operations described in eq. (1) to

estimate a vector prediction (e.g. y1). The ’letter’ predicted is chosen as the index of the

output with the largest value (i.e. find the hot vector the closest to (softmax) of y1).

2

Ex 3. [vanishing/exploding gradient]

We would like to illustrate one of the issue with vanilla RNN, namely the vanishing

or exploding gradient phenomenon. Rather than computing the gradient of the loss

function, we simply are going to investigate how a small perturbation in the input x1 will

affect the output yt (see figure 2).

A

B B

B

R R

Vanilla RNN

+1

-1

Figure 2: To study how a perturbation of x1 affects yt

, we suppose in this exercise that

x2 = . . . xt = 0 and h0 = 0. Due to the iterations of the matrix R in the estimation of

yt

, the perturbation of x1 could have small or large influence on yt

.

We consider a standard RNN defined with three matrices A, R, B and σ(x) = tanh(x)

(see figure 2).

a) Compute the differential Dht−1ht

, i.e. compute the differential of the function

h → σ(Rh + Axt).

Deduce that:

kDx1 ytk ≤ kBk · Y

t

k=1

|σ

0

(Rhk−1 + Axk)|∞

!

· kRk

t−1

· kAk, (2)

where k.k denotes (any) matrix norm and |σ

0

(h)|∞ = max(|σ

0

(h1)|, . . . , |σ

0

(hd)|)

where d is the dimension of the vector h.

b) From now on, we take t = 30 and suppose x, y, h ∈ R

2 with:

A = B =

”

1 0

0 1 #

, R =

”

1

2 −1

−1

1

2

#

, x2 = x3 = · · · = x30 = h0 =

0

0

!

.

Denote x1 = (0, 0) and y30 the output after t = 30 iterations.

Similarly, denote the perturbation x

ε

1 = (ε, −ε) and y

ε

30 the output after t = 30

iterations starting from x

ε

1

.

Compute and plot (in log-log scale) the difference ky30−y

ε

30k for ε ∈ (10−4

, . . . , 10−9

).

Relate the result with eq. (2).

c) Proceed similarly as b) using x1 = (2, 1) and x

ε

1 = (2 + ε, 1 − ε).

Why does the perturbation have a small effect in this case compare to b)?

3

Extra) Proceed similarly as b) using x1 = (0, 0) and x

ε

1 = (ε, ε). Why is the perturbation

having a small effect? In general, from a random perturbation, do you expect a

small or large effect when x1 = (0, 0)?

4