# Assignment 1 Course: Reinforcement Learning (CS6700) solution

\$25.00

Original Work ?

## Description

5/5 - (1 vote)

Problem 1.
At each time period, an animal chooses in which of 3 patches of land to forage. In patch 1,
the risk of predation is 0, the probability of finding food is 0.0, and its energy value is 0. In
patch 2, the risk of predation is 0.004 in each period, the probability of finding food is 0.3,
and the energy gain is 2 if food is found. In patch 3, the predation risk is 0.02, the probability
of finding food is 0.5 and its energy gain is 3. Foraging in any patch uses 1 unit of energy
reserve. Energy reserves below 2 indicate death and the animal’s maximum energy capacity
is 4 units.

(a) Formulate this patch selection as a finite horizon MDP with the goal of maximizing probability of survival of the animal over 3 time periods.

(b) Solve this problem using the DP algorithm and write down the optimal policy for foraging.

Problem 2.
A farmer produces xk units of crops. He/she stores (1 − uk)xk units of his crop production,
where 0 ≤ uk ≤ 1, and invests the remaining ukxk units to get more crops for the next year.

The next year’s production xk+1
is given by
xk+1 = xk + wkukxk
, k = 0, 1, …, N − 1
where scalars wk are independent identical random variables with probability distribution that
do not depend either on xk or uk
.

Furthermore, E{wk} = w¯ > 0. The problem is to find the
optimal investment policy that maximize the total expected crops stored over N years
Ew0,w1
,…,wN−1
{xN + Σ
N−1
k=0
(1 − uk)xk}

(a) Formulate this problem as an MDP.
(b) Characterize the optimal policy as best as you can.

Problem 3.
Consider an SSP with optimal expected cost J

(s), and an optimal policy π

.

(a) A new action a
0 becomes available in state s
0
. How can you determine whether π

is still
optimal in the modified SSP without re-solving the problem? If it is not, how can you
find a new optimal policy?

(b) Suppose action a

is optimal in state s

, that is π

(s

) = a

, and you find that the return in
state s
∗ under action a
∗ decreases by ∆. Provide an efficient way for determining whether
π

is still optimal and, if not, for finding a new optimal policy and its corresponding
expected cost.

Problem 4.
After a long disagreement, Kevin, Kanan, and Kannan agree to enter into a three-way
duel. The three shooters, Kevin, Kanan, and Kannan have shooting accuracy probabilities
0 < α < β < γ ≤ 1 respectively. Because of this disparity the shooters agree that Kevin
shall shoot first, followed by Kanan, followed by Kannan, this sequential rotation continuing
until only one man (the winner) remains standing.

When all three men are standing, the
active shooter must decide whom to shoot at, and he can only shoot at one opponent at
a time. Kevin is allowed to miss intentionally (i.e., by shooting into the air). Kanan and
Kannan are not allowed to miss intentionally.

Every shooter wants to maximize his probability of winning the game. The winner get a reward of 1, and in all other cases the reward is 0.
Now, consider you are Kevin, and you want to maximize your probability of winning the
game.

(a) Model your problem as a Markov decision process. Write down the state space, the action
space, the transition probabilities, and the single stage rewards.

(b) Write down the Bellman optimality equations.

(c) Let α = 0.3, β = 0.5, and γ = 0.6. Give the optimal values and the optimal policies for
each state.

Problem 5.
Consider an SSP problem, where all policies to be proper, i.e., the Bellman optimality operator
satisfies
||T J − T J0
||ξ ≤ ρ||J − J
0
||ξ
, ∀J, J
0 ∈ R
n
,
for some 0 < ρ < 1.

Consider the following variant of the value iteration algorithm, which is given an input parameter e > 0. This algorithm, say e-VI, terminates after m-iterations, where m satisfies
||Jm+1 − Jm||ξ < e

1 − ρ


.
In the above, Jk+1 = T Jk
, ∀k ≥ 1, and J0 is initialized arbitrarily. Note that e-VI algorithm
assumes knowledge of ρ and ξ.

(a) Recall that Jm+1 is what we obtain after e-VI terminates. Show that
||Jm+1 − J

||ξ <
e
2
.

(b) Let π
e be the policy that we obtain from Jm+1 as follows:
Tπe Jm+1 = T Jm+1.
Let Jπe denote the expected cost of the policy π
e
. Show that
||Jπe − J

||ξ < e.