AIGS/CSED515 Assignment 5: Generative Model and Reinforcement Learning solution

$29.99

Original Work ?
Category: You will Instantly receive a download link for .ZIP solution file upon Payment

Description

5/5 - (3 votes)

1. [VAE] We want to maximize the log-likelihood log pθ(x) of a model pθ(x) parameterized
by θ. To this end, we introduce a joint distribution pθ(x, z) and an approximate
posterior qφ(z | x) and reformulate the log-likelihood as:
log pθ(x) = log 
Ez∼qφ(z|x)

pθ(x, z)
qφ(z | x)
 (1)
(a) [3pt] Obtain a lower bound on the log-likelihood in (1) using Jensen’s inequality,
and decompose the bound into two parts, one of which must be the following KL
divergence:
KL(qφ(z | x), pθ(z)) . (2)
(b) [4pt] Let qφ(z | x) = N (µz|x, σ2
z|x
) where µz|x ∈ R and σz|x are the outputs of a
neural network parameterized by φ given x. Also, let pθ(z) = N (µ, σ2
) for given
µ ∈ R and σ
2
. Obtain the KL divergence in a closed form with µz|x, σz|x, µ and
σ
2
. When does the KL divergence become zero?
(c) [3pt] Let pθ(x | z) = N (µx|z, σ2
x|z
) where µx|z and σx|z are the outputs of a neural
network parameterized by θ given z. VAE is trained by gradient ascent to maximize the lower bound in 1a instead of (1). To perform back-propagation, we need
to compute
∇φ Ez∼qφ(z|x)

log 
pθ(x, z)
qφ(z | x)
 . (3)
It is intractable to compute the gradient. In addition, even if we approximate the
expectation via sampling, i.e., ∇φ Ez∼qφ(z|x)
h
log 
pθ(x,z)
qφ(z|x)
i ≈
PJ
j=1 ∇φ log 
pθ(x,z(j)
)
qφ(z
(j)
|x)

with z
(j) ∼ qφ(z
(j)
| x), it is still intractable to take the gradient of z
(j)
’s directly.
To bypass this challenge, we can use a reparameterization trick. Represent random
variable z ∼ N (µz|x, σ2
z|x
) as a function of µz|x, σ2
z|x
, and ε ∼ N (0, 1). Describe
how to approximate (3) via the sampling and representation trick.
(d) [2pt] Complete reparameterize function in VAE.py.
def reparameterize (self, mu, logvar)
std = torch.exp(…fill this… )
eps = torch.randn like(…fill this… )
return eps.mul(std).add (mu)
2
2. [MDP] Consider an MDP, which may give a short lesson on how we live or how we
face this final exam, with four states {−1, 0, 1, 2}, at each of which two actions (+: try,
−: give-up) are available, and the reward and state transition have no randomness.
The following figure summaries the reward function and state transition:
−1 0 1 2
R(s,a)=1 R(s,a)=3
― consequence of taking action + (try)
— consequence of taking action – (give-up)
The reward r(s, a) of taking action a at state s is non-zero only for (s, a) ∈ {(−1, −),(2, +)},
where r(−1, −) = 1 and r(2, +) = 3. The next state when taking action a at state s is
the state which the corresponding arrow head in the figure points at.
(a) [4pt] For discount factor γ = 0.10, find the optimal policy π

short and compute the
optimal value function V

(s; γ). Justify your answer.
(b) [4pt] For discount factor γ = 0.99, find the optimal policy π

long and compute the
optimal value function V

(s; γ). Justify your answer.
(c) [3pt] Compute the value function of π

short in Problem 2a for discount factor
γ = 0.99, i.e., compute V
π

short (s; γ = 0.99), and discuss the impact of γ.
3