IN4320 Exercise 6 Exercises Reinforcement Learning solution

$29.99

Original Work ?
Category: You will Instantly receive a download link for .ZIP solution file upon Payment

Description

5/5 - (3 votes)

In this exercise you will apply various basic reinforcement learning methods to a toy example. Exercises
1-5 can be done based on the lecture slides, for the other questions either use the resources provided
on Blackboard or literature you found on your own. Cite all the references you employed.
Exercise 1 (10 points) Proof that the return Rt =
P∞
h=0 γ
h
rt+h+1 is bounded for 0 ≤ γ < 1 and
for bounded rewards −10 ≤ rt+h+1 ≤ 10 ∀h ∈ {0, 1, . . . , ∞}. Note: 00 = 1.
Scenario Our robot found a new job as cleaning robot. The robot only has the actions “left” and
“right”. It is working in a corridor with 6 states, the two end-states are terminal, i.e., the episode
ends immediately once the robots reaches them. It gets a reward of 1 when reaching the left state
(charger) and a reward of 5 when reaching the right state (trash bin).
actions a=-1 a=1
states
s=1 s=2 s=3 s=4 s=5 s=6
S = {1, 2, 3, 4, 5, 6} A = {−1, 1} = {left, right}
st+1 =
(
st
if st
is terminal (st = 1 or st = 6)
st + at otherwise, where at ∈ A
rt+1 =



5 if st+1 = 6 and st 6= 6
1 if st+1 = 1 and st 6= 1
0 otherwise
Exercise 2 (10 points) Implement Q-iteration. For γ = 0.5 the optimal Q-function is given in the
table below. Show the values after each iteration. What is the optimal policy π

?
P
action
PPPPPPP
state 1 2 3 4 5 6
left 0.0 1.0 0.5 0.625 1.25 0.0
right 0.0 0.625 1.25 2.5 5.0 0.0
Exercise 3 (10 points) Show the optimal value functions Q∗
for γ = 0, γ = 0.1,γ = 0.9, and
γ = 1. Discuss the influence of the discount factor γ. Explain why γ = 1 will work here in contrast to
Exercise 1.
Exercise 4 (15 points) Implement Q-learning. For the rest of the exercises use γ = 0.5. Try
different values for the exploration ε and the learning rate α. Plot the difference (2-norm) between
the value function estimated by Q-learning and the true value function (table above) over the number
of interactions with the system. Provide plots for different values of ε and α. Describe and explain
the differences in behavior.
1
Exercise 5 (15 points) Now the robot is partially broken and stays at the same state with a
probability of 30%, else it works correctly. Test this scenario with Q-iteration and Q-learning. What
happens? Can you still use the same Q-learning parameters?
Exercise 6 (40 points) Somebody tried to fix the robot. At least it is not stopping any longer
but now we have a different problem: Rather than moving s
0 = s + a the robot now moves s
0 =
s+a+N (0, 0.01), where N (0, 0.01) is a Gaussian distribution with mean µ = 0 and variance σ
2 = 0.01.
Now we need to consider continuous states and non-deterministic transitions. The terminal states are
reached for s < 1.5 and s ≥ 5.5 respectively. Implement value function approximation with radial
basis functions (we still have 2 discrete actions) for Q-learning or another algorithm of your choice.
Provide pseudo-code. How many basis functions do you need? Do the widths have an influence on
the performance? What happens to the learning performance compared to the other approaches?
Illustrate with plots and discuss.
Hints: As you cannot use Q-iteration any longer to find the ground truth you will need to show
how quickly the learning converges and how the performance evolves (e.g., expected return). Start
with 6 basis functions centered at {1, 2, 3, 4, 5, 6} and test your implementation with σ
2 = 0. Often
normalizing the RBF network1
is a good idea.
Exercise 7 (max 40 points) Bonus points2
: implement one additional RL algorithm (e.g., policy
iteration, TD(λ), SARSA, actor critic, policy search) or an additional variant of the above algorithms
(e.g., state discretization, different function approximator, eligibility traces). Provide pseudo-code.
Discuss the the advantages and disadvantages of the chosen approach.
1See, e.g., http://en.wikipedia.org/wiki/Radial basis function network#Normalized
2You can get a maximum of 100 points (which corresponds to a grade of 10) for Exercises 1-6. Points you missed in
Exercises 1-6 can be compensated for by Exercise 7. The overall maximum grade is 10.
2