# Assignment 2 Course: Reinforcement Learning (CS6700) solution

\$25.00

Original Work ?
Category: You will Instantly receive a download link for .ZIP solution file upon Payment

5/5 - (1 vote)

## I. Theory exercises

Problem 1.
Consider an MDP with a finite-horizon. For this problem, derive a policy iteration algorithm.
In particular, provide the policy evaluation and policy improvement steps. Compare the computational requirements of policy iteration to that of DP algorithm, assuming horizon N, n
states per stage, and m actions in each state. (5 marks)

Problem 2.
Consider an episodic MDP with a finite state space X ∈ {1, · · · , n}, and a special state 0 as
the terminal state. Let T ∈ N be the random length of an episode. Consider a deterministic
and bounded reward function r : X → R, and assume zero reward at the terminal state.

For each x ∈ X , a fixed policy π determines a stochastic transition to a subsequent state
x
0 ∈ {X ∪ 0} with probability P(x
0
| x). Assume that the policy π is proper. We denote by
xt
the state at time t, where t = 0, 1, 2, · · · .

The cumulative discounted reward over an episode be R ∈ R is defined by
R =
T−1

t=0
γ
t
r(xt), γ ∈ (0, 1).

Define the value function J(·), and the variance V(·) of the cumulative discounted reward as
J(x) = E[R | x0 = x], V(x) = Var[R | x0 = x], ∀x ∈ X .

Also define the second moment M(·) of the cumulative discounted reward as
M(x) = E[R
2
| x0 = x], ∀x ∈ X .

Answer the following: (3+2 marks)
(a) ∀x ∈ X, express J(x) and M(x) in terms of Bellman equation (exclude the terminal state
0).

(b) Show that
∀x ∈ X, V(x) = ψ(x) + γ
2 ∑
x
0∈X
P(x
0
| x)V(x
0
), where
ψ(x) = γ
2

 ∑
x
0∈X
P(x
0
| x)J(x
0
)
2 −

x
0∈X
P(x
0
| x)J(x
0
)
!2

 .

Problem 3.
Let Rd be a d-dimensional Euclidean space with supremum norm kxk∞ = max
1≤i≤d
|xi
|, and
with ordering x  y if xi ≤ yi
, for all 1 ≤ i ≤ d. Let Id be a d-dimensional vector with all
elements equal to one.

Prove the following: (1+1+3 marks)
(a) Let f : Rd → Rd be a monotone contraction mapping with respect to the supremum
norm with contraction factor β, and let a be a scalar. Then, x  y + aId
implies f(x) 
f(y) + β|a|Id
.

(b) Let f : Rd → Rd be a mapping with property x  y + aId
implies f(x)  f(y) +
β|a|Id
, for all scalar a, and for some 0 ≤ β < 1. Then with respect to the supremum
norm, f is a monotone contraction with contraction factor β.

(c) Let f : Rd → Rd be a monotone contraction mapping with respect to the supremum
norm with contraction factor β, and fixed point x

. Then
x −
1
1 − β
k f(x) − xk∞ Id  f(x) −
β
1 − β
k f(x) − xk∞ Id  x

 f(x) + β
1 − β
k f(x) − xk∞ Id  x +
1
1 − β
k f(x) − xk∞ Id
.

II. Simulation exercises
The programming component of this assignment is available at AIcrowd platform:
https://www.aicrowd.com/challenges/iit-m-2021-assignment-2.
The total marks for this component is 20, and the grading will be done through the AIcrowd
interface.
The instructions on how to use the AIcrowd interface are available at
https://wiki.aicrowd.com/share/2e92c0cb-870a-4d56-b8dc-959a9723da54.