# DDA 4230: Reinforcement learning Assignment 1 solution

\$25.00

Original Work ?

5/5 - (1 vote)

## Problem 1: Markov Decision Process [20 pts.]

Figure 1: MDP for Problem 1. States are represented by circles and actions by hexagons. p, q denotes the
transition probability and p, q ∈ [0, 1]. The reward is 10 for state s3, 1 for state s2 and 0 otherwise.

For this question, consider the infinite horizon (where time t = ∞) MDP represented by Figure 1 with
discount factor γ ∈ (0, 1).

1. List all the possible policies. [4 pts.]

2. Show the equations representing the optimal value functions for all states,
including V

(s0), V ∗
(s1), V ∗
(s2) and V

(s3). [6 pts.]
For example, for V

(s0), the representation is:
V

(s0) = max
a∈{a1,a2}
0 + γ
X
s

P (s

| s, a) V

(s

) = γ max {V

(s1), V ∗
(s2)} (1)
1

3. Is there a value for p such that for all γ ∈ (0, 1) and q ∈ [0, 1], π∗
(s0) = a2?
Explain. [5 pts.]

4. Is there a value for q such that for all γ ∈ (0, 1) and p ∈ [0, 1], π∗
(s0) = a1?
Explain. [5 pts.]

## Problem 2: Fixed Point [25 pts.]

Recall from lecture that the optimal value function can be written as:
V

(st) = max
a
E [r(st, a) + γV ∗
(st+1)|at = a] .

Let r(s, a) denotes the reward received of choosing action a on state s and P(s

|s, a) be the probability of
transiting to state s
′ when choosing action a on state s.

The bellman operator can be defined as B : R
S −→
R
S, where
(BV )(s) = max
a
r(s, a) + γ
X
s

P(s

|s, a)V (s

).

In this problem, we are going to show a few properties about the bellman operator. We’ll see that if we
know the transition function P(s

|s, a) and the reward function r(s, a), then repeating this operator on our
value function helps recover the optimal value function, in other words, we want to show that value iteration
will converge to a unique fixed point V regardless of the starting point.

An element V is a fixed point for
an operator B (in this case the Bellman operator) if performance of B on V returns V , i.e., BV = V .

1. Contraction Property [5 pts.]
Prove that the Bellman operator B is a contraction operator for γ ∈ (0, 1) with respect to the infinity norm
∥ · ∥∞ (you may want to search up the definition for this). Specifically, we want to prove that
∥BV − BV ′
∥∞ ≤ γ∥V − V

∥∞ (2)
for any two value functions V and V

, meaning if we apply it to two different value functions, the distance
between value functions (in the ∞ norm) shrinks after application of the operator to each element.

According to the above contraction property, there are some helpful lead-up proofs for you to obtain final
proof of the fixed point (i.e., optimal value function).

2.1. Let’s define ∥Vn+1 − Vn∥∞ = ∥BVn − BVn−1∥∞, please prove by induction that ∥Vn+1 −
Vn∥∞ ≤ γ
n∥V1 − V0∥∞.
2.2 Prove that for any c > 0, ∥Vn+c − Vn∥∞ ≤
γ
n
1−γ
∥V1 − V0∥∞

3. Fixed Point [5 pts.]

A Cauchy sequence is a sequence whose elements become arbitrarily close to each other as the sequence
progresses. Formally a sequence {an} in metric space X with distance metric d is a Cauchy sequence if given
an ϵ > 0 there exists k such that if m, n > k then d(am, an) < ϵ. Real Cauchy sequences are convergent.
Using this information about Cauchy sequences, argue that the sequence V0, V1, … is a Cauchy sequence and
is therefore convergent and must converge to some element V and this V is a fixed point.

4. Uniqueness [5 pts.]
Show that this fixed point is unique.

## Problem 3: Practice of bandit problem [55 pts.]

Consider a two-armed bandit problem, where each arm’s distribution is Bernoulli. Consider the following
three problem variants, with respective Bernoulli distribution parameters specified for each arm:
Problem Arm 1 Arm 2
P1 0.8 0.6
P2 0.8 0.7
P3 0.55 0.45
P4 0.5 0.5

Write a Python program to simulate each of the above bandit problems. Specifically, for each problem,
you need to do:
1. Choose the horizon n as 10000.
2. For each algorithm, repeat the experiment 100 times.
3. Store the number of times an algorithm plays the optimal arm, for each round t = 1, · · · , n.

4. Store the regret in each round t = 1, · · · , n.
5. Plot the percentage of optimal arm played and regret against the rounds t = 1, · · · , n,
6. For each plot, add standard error bars.
Do the above for the following bandit algorithms:

1. Explore-then-Commit (ETC) algorithm [20 pts.]
The explore-then-commit (ETC) algorithm with exploration parameter k chosen optimally so that the gapdependent regret is minimum (this choice for k would require information about underlying gap).

2. ETC algorithm with a heuristic [20 pts.]
The ETC algorithm with a heuristic choice for exploration parameter k. Try different values for k and
summarize your findings, say by tabulating regret for different k.

3. Interpretation and summarization [15 pts.]
Interpret the numerical results and submit your conclusions. In particular, discuss the following:
1. Explain the results obtained for ETC with optimal k and correlate the results to the theoretical findings.

2. Explain the results obtained for ETC with a heuristic choice for k. In particular, how does ETC with
a k that is far from the optimal, perform?

Note that you need to submit: 1) Source code, preferably one that is readable with some comments; 2)
Plots/tabulated results in a document (or you could submit printouts of plots); 3) Discussion of the results –
either hand-written or typed-up.