APM 598: Homework 3 solution

\$30.00

Category:

Description

1 n-gram models
Ex 1.
a) Load and tokenize the text attached ’Plato_Republic.txt’.
Put all the words in lower case to regroup words like ’The’ and ’the’.
Compute the total number of words N in the text and the number of unique words
(size of the vocabulary).
b) Build a uni-gram. Deduce the 5 most common words with at least 8 characters.
Hint: use the method ’most_common’ on an object ’nltk.FreqDist’.
c) Build a bi-gram and define a function that given two words (x1, x2) compute the
probability:
P(x2|x1) = #{(x1, x2)}
#{x1}
where # denotes the number of occurences of the word (or pair of words) in the
corpus.
d) Deduce the so-called perplexity of the bi-gram model defined as:
P P =

Y
k=1..(N−1)
P(xk+1|xk)

− 1
N−1
where N denotes the total number of words in the corpus.
2 Recurrent Neural Networks
Ex 2.
The goal of this exercise is to experiment with a simple Recurrent Neural Network
(RNN) model for predicting letters. We only consider four letters ”h”, ”e”, ”l” and ”o”
that we embed in R
4
:
”h” →


1
0
0
0


, ”e” →


0
1
0
0


, ”l” →


0
0
1
0


, ”o” →


0
0
0
1


.
1
We consider a RNN with hidden states ht
in R
2
:
(
ht = tanh(Rht−1 + Axt)
yt = Bht
(1)
where A ∈ M2,4(R), R ∈ M2,2(R) and B ∈ M4,2(R) (e.g. A is a 2 × 4 matrix).
a) Given the input ”hello” (i.e. x1 = (1, 0, 0, 0), . . . , x5 = (0, 0, 0, 1)), the initial state
h0 = (0, 0) and the matrices:
A =

1 −1 −1/2 1/2
1 1 −1/2 −1
#
, R =

1 0
0 1 #
, B =

1 1
1/2 1
−1 0
0 −1/2

,
find the output y1, . . . , y5 and deduce the predicted characters (see figure 1).
b) Find matrices A, R, B such that the predicted characters are ”olleh”.
“h” “e” “l” “l” “o”
A
B
R
(0,0)
A
B
A
B
A
B
A
B
? ? ? ? ?
embedding
decoding
Figure 1: Predictions of a vanilla RNN. After encoding the letters (e.g. “h”) into vectors
(e.g. x1 = (1, 0, 0, 0)), the network performs the operations described in eq. (1) to
estimate a vector prediction (e.g. y1). The ’letter’ predicted is chosen as the index of the
output with the largest value (i.e. find the hot vector the closest to (softmax) of y1).
2
We would like to illustrate one of the issue with vanilla RNN, namely the vanishing
or exploding gradient phenomenon. Rather than computing the gradient of the loss
function, we simply are going to investigate how a small perturbation in the input x1 will
affect the output yt (see figure 2).
A
B B
B
R R
Vanilla RNN
+1
-1
Figure 2: To study how a perturbation of x1 affects yt
, we suppose in this exercise that
x2 = . . . xt = 0 and h0 = 0. Due to the iterations of the matrix R in the estimation of
yt
, the perturbation of x1 could have small or large influence on yt
.
We consider a standard RNN defined with three matrices A, R, B and σ(x) = tanh(x)
(see figure 2).
a) Compute the differential Dht−1ht
, i.e. compute the differential of the function
h → σ(Rh + Axt).
Deduce that:
kDx1 ytk ≤ kBk · Y
t
k=1

0
(Rhk−1 + Axk)|∞
!
· kRk
t−1
· kAk, (2)
where k.k denotes (any) matrix norm and |σ
0
(h)|∞ = max(|σ
0
(h1)|, . . . , |σ
0
(hd)|)
where d is the dimension of the vector h.
b) From now on, we take t = 30 and suppose x, y, h ∈ R
2 with:
A = B =

1 0
0 1 #
, R =

1
2 −1
−1
1
2
#
, x2 = x3 = · · · = x30 = h0 =

0
0
!
.
Denote x1 = (0, 0) and y30 the output after t = 30 iterations.
Similarly, denote the perturbation x
ε
1 = (ε, −ε) and y
ε
30 the output after t = 30
iterations starting from x
ε
1
.
Compute and plot (in log-log scale) the difference ky30−y
ε
30k for ε ∈ (10−4
, . . . , 10−9
).
Relate the result with eq. (2).
c) Proceed similarly as b) using x1 = (2, 1) and x
ε
1 = (2 + ε, 1 − ε).
Why does the perturbation have a small effect in this case compare to b)?
3
Extra) Proceed similarly as b) using x1 = (0, 0) and x
ε
1 = (ε, ε). Why is the perturbation
having a small effect? In general, from a random perturbation, do you expect a
small or large effect when x1 = (0, 0)?
4