## Description

Problem 1.

At each time period, an animal chooses in which of 3 patches of land to forage. In patch 1,

the risk of predation is 0, the probability of finding food is 0.0, and its energy value is 0. In

patch 2, the risk of predation is 0.004 in each period, the probability of finding food is 0.3,

and the energy gain is 2 if food is found. In patch 3, the predation risk is 0.02, the probability

of finding food is 0.5 and its energy gain is 3. Foraging in any patch uses 1 unit of energy

reserve. Energy reserves below 2 indicate death and the animal’s maximum energy capacity

is 4 units.

Answer the following: (1+2 marks)

(a) Formulate this patch selection as a finite horizon MDP with the goal of maximizing probability of survival of the animal over 3 time periods.

(b) Solve this problem using the DP algorithm and write down the optimal policy for foraging.

Problem 2.

A farmer produces xk units of crops. He/she stores (1 − uk)xk units of his crop production,

where 0 ≤ uk ≤ 1, and invests the remaining ukxk units to get more crops for the next year.

The next year’s production xk+1

is given by

xk+1 = xk + wkukxk

, k = 0, 1, …, N − 1

where scalars wk are independent identical random variables with probability distribution that

do not depend either on xk or uk

.

Furthermore, E{wk} = w¯ > 0. The problem is to find the

optimal investment policy that maximize the total expected crops stored over N years

Ew0,w1

,…,wN−1

{xN + Σ

N−1

k=0

(1 − uk)xk}

Answer the following: (1.5+2.5 marks)

(a) Formulate this problem as an MDP.

(b) Characterize the optimal policy as best as you can.

Problem 3.

Consider an SSP with optimal expected cost J

∗

(s), and an optimal policy π

∗

.

Answer the following: (2+2 marks)

(a) A new action a

0 becomes available in state s

0

. How can you determine whether π

∗

is still

optimal in the modified SSP without re-solving the problem? If it is not, how can you

find a new optimal policy?

(b) Suppose action a

∗

is optimal in state s

∗

, that is π

∗

(s

∗

) = a

∗

, and you find that the return in

state s

∗ under action a

∗ decreases by ∆. Provide an efficient way for determining whether

π

∗

is still optimal and, if not, for finding a new optimal policy and its corresponding

expected cost.

Problem 4.

After a long disagreement, Kevin, Kanan, and Kannan agree to enter into a three-way

duel. The three shooters, Kevin, Kanan, and Kannan have shooting accuracy probabilities

0 < α < β < γ ≤ 1 respectively. Because of this disparity the shooters agree that Kevin

shall shoot first, followed by Kanan, followed by Kannan, this sequential rotation continuing

until only one man (the winner) remains standing.

When all three men are standing, the

active shooter must decide whom to shoot at, and he can only shoot at one opponent at

a time. Kevin is allowed to miss intentionally (i.e., by shooting into the air). Kanan and

Kannan are not allowed to miss intentionally.

Every shooter wants to maximize his probability of winning the game. The winner get a reward of 1, and in all other cases the reward is 0.

Now, consider you are Kevin, and you want to maximize your probability of winning the

game.

Answer the following: (2+2+1 marks)

(a) Model your problem as a Markov decision process. Write down the state space, the action

space, the transition probabilities, and the single stage rewards.

(b) Write down the Bellman optimality equations.

(c) Let α = 0.3, β = 0.5, and γ = 0.6. Give the optimal values and the optimal policies for

each state.

Problem 5.

Consider an SSP problem, where all policies to be proper, i.e., the Bellman optimality operator

satisfies

||T J − T J0

||ξ ≤ ρ||J − J

0

||ξ

, ∀J, J

0 ∈ R

n

,

for some 0 < ρ < 1.

Consider the following variant of the value iteration algorithm, which is given an input parameter e > 0. This algorithm, say e-VI, terminates after m-iterations, where m satisfies

||Jm+1 − Jm||ξ < e

1 − ρ

2ρ

.

In the above, Jk+1 = T Jk

, ∀k ≥ 1, and J0 is initialized arbitrarily. Note that e-VI algorithm

assumes knowledge of ρ and ξ.

Answer the following: (3+2 marks)

(a) Recall that Jm+1 is what we obtain after e-VI terminates. Show that

||Jm+1 − J

∗

||ξ <

e

2

.

(b) Let π

e be the policy that we obtain from Jm+1 as follows:

Tπe Jm+1 = T Jm+1.

Let Jπe denote the expected cost of the policy π

e

. Show that

||Jπe − J

∗

||ξ < e.