## Description

## I. Theory exercises

Problem 1.

Consider an MDP with a finite-horizon. For this problem, derive a policy iteration algorithm.

In particular, provide the policy evaluation and policy improvement steps. Compare the computational requirements of policy iteration to that of DP algorithm, assuming horizon N, n

states per stage, and m actions in each state. (5 marks)

Problem 2.

Consider an episodic MDP with a finite state space X ∈ {1, · · · , n}, and a special state 0 as

the terminal state. Let T ∈ N be the random length of an episode. Consider a deterministic

and bounded reward function r : X → R, and assume zero reward at the terminal state.

For each x ∈ X , a fixed policy π determines a stochastic transition to a subsequent state

x

0 ∈ {X ∪ 0} with probability P(x

0

| x). Assume that the policy π is proper. We denote by

xt

the state at time t, where t = 0, 1, 2, · · · .

The cumulative discounted reward over an episode be R ∈ R is defined by

R =

T−1

∑

t=0

γ

t

r(xt), γ ∈ (0, 1).

Define the value function J(·), and the variance V(·) of the cumulative discounted reward as

J(x) = E[R | x0 = x], V(x) = Var[R | x0 = x], ∀x ∈ X .

Also define the second moment M(·) of the cumulative discounted reward as

M(x) = E[R

2

| x0 = x], ∀x ∈ X .

Answer the following: (3+2 marks)

(a) ∀x ∈ X, express J(x) and M(x) in terms of Bellman equation (exclude the terminal state

0).

(b) Show that

∀x ∈ X, V(x) = ψ(x) + γ

2 ∑

x

0∈X

P(x

0

| x)V(x

0

), where

ψ(x) = γ

2

∑

x

0∈X

P(x

0

| x)J(x

0

)

2 −

∑

x

0∈X

P(x

0

| x)J(x

0

)

!2

.

Problem 3.

Let Rd be a d-dimensional Euclidean space with supremum norm kxk∞ = max

1≤i≤d

|xi

|, and

with ordering x y if xi ≤ yi

, for all 1 ≤ i ≤ d. Let Id be a d-dimensional vector with all

elements equal to one.

Prove the following: (1+1+3 marks)

(a) Let f : Rd → Rd be a monotone contraction mapping with respect to the supremum

norm with contraction factor β, and let a be a scalar. Then, x y + aId

implies f(x)

f(y) + β|a|Id

.

(b) Let f : Rd → Rd be a mapping with property x y + aId

implies f(x) f(y) +

β|a|Id

, for all scalar a, and for some 0 ≤ β < 1. Then with respect to the supremum

norm, f is a monotone contraction with contraction factor β.

(c) Let f : Rd → Rd be a monotone contraction mapping with respect to the supremum

norm with contraction factor β, and fixed point x

∗

. Then

x −

1

1 − β

k f(x) − xk∞ Id f(x) −

β

1 − β

k f(x) − xk∞ Id x

∗

f(x) + β

1 − β

k f(x) − xk∞ Id x +

1

1 − β

k f(x) − xk∞ Id

.

II. Simulation exercises

The programming component of this assignment is available at AIcrowd platform:

https://www.aicrowd.com/challenges/iit-m-2021-assignment-2.

The total marks for this component is 20, and the grading will be done through the AIcrowd

interface.

The instructions on how to use the AIcrowd interface are available at

https://wiki.aicrowd.com/share/2e92c0cb-870a-4d56-b8dc-959a9723da54.