# ECE 232E Project 3: Reinforcement learning and Inverse Reinforcement learning solution

\$29.99

Category:

## Description

1 Introduction
Reinforcement Learning (RL) is the task of learning from interaction to achieve
a goal. The learner and the decision maker is called the agent. The thing
it interacts with, comprising everything outside the agent, is called the environment. These interact continually, the agent selecting actions and the
environment responding to those actions by presenting rewards and new states.
In the first part of the project, we will learn the optimal policy of an agent
navigating in a 2-D environment. We will implement the Value iteration algorithm to learn the optimal policy.
Inverse Reinforcement Learning (IRL) is the task of extracting an expert’s reward function by observing the optimal policy of the expert. In the second
part of the project, we will explore the application of IRL in the context of
apprenticeship learning.
2 Reinforcement learning (RL)
The two main objects in Reinforcement learning are:
• Agent
• Environment
In this project, we will learn the optimal policy of a single agent navigating in
a 2-D environment.
2.1 Environment
In this project, we assume that the environment of the agent is modeled by a
Markov Decision Process (MDP). In a MDP, agents occupy a state of the
environment and perform actions to change the state they are in. After taking
an action, they are given some representation of the new state and some reward
value associated with the new state.
1
An MDP formally is a tuple (S, A,Pa
ss0 , Ra
ss0 , ) where:
• S is a set of states
• A is a set of actions
• Pa
ss0 is a set of transition probabilities, where Pa
ss0 is the probability of
transitioning from state s 2 S to state s0 2 S after taking action a 2 A
– Pa
ss0 = P(st+1 = s0
|st = s, at = a)
• Given any current state and action, s and a, together with any next state,
s0
, the expected value of the next reward is Ra
ss0
– Ra
ss0 = E(rt+1|st = s, at = a, st+1 = s0
)
• 2 [0, 1) is the discount factor, and it is used to compute the present
value of future reward
– If is close to 1 then the future rewards are discounted less
– If is close to 0 then the future rewards are discounted more
In the next few subsections, we will discuss the parameters that will be used to
generate the environment for the project.
2.1.1 State space
In this project, we consider the state space to be a 2-D square grid with 100
states. The 2-D square grid along with the numbering of the states is shown in
figure 1
Figure 1: 2-D square grid with state numbering
2.1.2 Action set
In this project, we consider the action set(A) to contain the 4 following actions:
• Move Right
• Move Left
2
• Move Up
• Move Down
The 4 types of actions are displayed in figure 2
Figure 2: 4 types of action
From the above figure, we can see that the agent can take 4 actions from
the state marked with a dot.
2.1.3 Transition probabilities
In this project, we define the transition probabilities in the following manner:
1. If state s0 and s are not neighboring states in the 2-D grid, then
P(st+1 = s0
|st = s, at = a)=0
s0 and s are neighbors in the 2-D grid if you can move to s0 from s by
taking an action a from the action set A. We will consider a state s to be
a neighbor of itself. For example, from figure 1 we can observe that states
1 and 11 are neighbors (we can transition from 1 to 11 by moving right)
but states 1 and 12 are not neighbors.
2. Each action corresponds to a movement in the intended direction with
probability 1 w, but has a probability of w of moving in a random
direction instead due to wind. To illustrate this, let’s consider the states
shown in figure 3
3
Figure 3: Inner grid states (Non-boundary states)
The transition probabilities for the non-boundary states shown in figure
3 are given below:
P(st+1 = 43|st = 44, at =”)=1 w +
w
4
P(st+1 = 34|st = 44, at =”) = w
4
P(st+1 = 54|st = 44, at =”) = w
4
P(st+1 = 45|st = 44, at =”) = w
4
From the above calculation it can be observed that if the agent is at a nonboundary state then it has 4 neighbors excluding itself and the probability
w is uniformly distributed over the 4 neighbors. Also, if the agent is at
a non-boundary state then it transitions to a new state after taking an
action (P(st+1 = 44|st = 44, at =”) = 0)
3. If the agent is at one of the four corner states (0,9,90,99), the agent stays
at the current state if it takes an action to move o↵ the grid or is blown
o↵ the grid by wind. The actions can be divided into two categories:
• Action to move o↵ the grid
• Action to stay in the grid
To illustrate this, let’s consider the states shown in figure 4
4
Figure 4: Corner states
The transition probabilities for taking an action to move o↵ the grid are
given below:
P(st+1 = 10|st = 0, at =”) = w
4
P(st+1 = 1|st = 0, at =”) = w
4
P(st+1 = 0|st = 0, at =”)=1 w +
w
4 +
w
4
The transition probabilities for taking an action to stay in the grid are
given below:
P(st+1 = 10|st = 0, at =!)=1 w +
w
4
P(st+1 = 1|st = 0, at =!) = w
4
P(st+1 = 0|st = 0, at =!) = w
4 +
w
4
At a corner state, you can be blown o↵ the grid in two directions. As a
result, we have P(st+1 = 0|st = 0, at =!) = w
4 + w
4 since we can be blown
o↵ the grid in two directions and in both the cases we stay at the current
state.
4. If the agent is at one of the edge states, the agent stays at the current
state if it takes an action to move o↵ the grid or is blown o↵ the grid by
wind. The actions can be divided into two categories:
• Action to move o↵ the grid
• Action to stay in the grid
To illustrate this, let’s consider the states shown in figure 5
5
Figure 5: Edge states
The transition probabilities for taking an action to move o↵ the grid are
given below:
P(st+1 = 0|st = 1, at = ) = w
4
P(st+1 = 11|st = 1, at = ) = w
4
P(st+1 = 2|st = 1, at = ) = w
4
P(st+1 = 1|st = 1, at = )=1 w +
w
4
The transition probabilities for taking an action to stay in the grid are
given below:
P(st+1 = 0|st = 1, at =”)=1 w +
w
4
P(st+1 = 11|st = 1, at =”) = w
4
P(st+1 = 2|st = 1, at =”) = w
4
P(st+1 = 1|st = 1, at =”) = w
4
At an edge state, you can be blown o↵ the grid in one direction. As a
result, we have P(st+1 = 1|st = 1, at =”) = w
4 since we can be blown o↵
the grid in one direction and in that case we stay at the current state.
The main di↵erence between a corner state and an edge state is that a corner
state has 2 neighbors and an edge state has 3 neighbors.
2.1.4 Reward function
To simplify the project, we will assume that the reward function is independent
of the current state (s) and the action that you take at the current state (a).
To be specific, reward function only depends on the state that you transition to
(s0
). With this simplification, we have
Ra
ss0 = R(s0
)
6
In this project, we will learn the optimal policy of an agent for two di↵erent
reward functions:
• Reward function 1
• Reward function 2
The two di↵erent reward functions are displayed in figures 6 and 7 respectively
Figure 6: Reward function 1
Figure 7: Reward function 2
Question 1: (10 points) For visualization purpose, generate heat maps of
Reward function 1 and Reward function 2. For the heat maps, make sure you
display the coloring scale. You will have 2 plots for this question
For solving question 1, you might find the following function useful:
https://matplotlib.org/api/_as_gen/matplotlib.pyplot.pcolor.html
7
3 Optimal policy learning using RL algorithms
In this part of the project, we will use reinforcement learning (RL) algorithm
to find the optimal policy. The main steps in RL algorithm are:
• Find optimal state-value or action-value
• Use the optimal state-value or action-value to determine the deterministic
optimal policy
There are a couple of RL algorithms, but we will use the Value iteration algorithm since it was discussed in detail in the lecture. We will skip the derivation
of the algorithm here because it was covered in the lecture (for the derivation
details please refer to the lecture slides on Reinforcement learning). We will just
reproduce the algorithm below for the ease of implementation:
1: procedure Value Iteration(Pa
ss0 , Ra
ss0 , S, A, ):
2: for all s 2 S do . Initialization
3: V (s) 0
4: end for
5: 1
6: while > ✏ do . Estimation
7: 0
8: for all s 2 S do
9: v V (s);
10: V (s) max
a2A
P
s02S Pa
ss0 [Ra
ss0 + V (s0
)];
11: max(, |v V (s)|);
12: end for
13: end while
14: for all s 2 S do . Computation
15: ⇡(s) arg max
a2A
P
s02S Pa
ss0 [Ra
ss0 + V (s0
)];
16: end for
17: end procedure return ⇡
Question 2: (40 points) Create the environment of the agent using the information provided in section 2. To be specific, create the MDP by setting up
the state-space, action set, transition probabilities, discount factor, and reward
function. For creating the environment, use the following set of parameters:
• Number of states = 100 (state space is a 10 by 10 square grid as displayed
in figure 1)
• Number of actions = 4 (set of possible actions is displayed in figure 2)
• w = 0.1
• Discount factor = 0.8
• Reward function 1
8
After you have created the environment, then write an optimal state-value function that takes as input the environment of the agent and outputs the optimal
value of each state in the grid. For the optimal state-value function, you have
to implement the Initialization (lines 2-4) and Estimation (lines 5-13) steps of
the Value Iteration algorithm. For the estimation step, use ✏ = 0.01. For visualization purpose, you should generate a figure similar to that of figure 1 but
with the number of state replaced by the optimal value of that state. In this
question, you should have 1 plot.
Question 3: (5 points) Generate a heat map of the optimal state values across
the 2-D grid. For generating the heat map, you can use the same function provided in the hint earlier (see the hint after question 1).
Question 4: (15 points) Explain the distribution of the optimal state values
across the 2-D grid. (Hint: Use the figure generated in question 3 to explain)
Question 5: (30 points) Implement the computation step of the value iteration
algorithm (lines 14-17) to compute the optimal policy of the agent navigating
the 2-D state-space. For visualization purpose, you should generate a figure
similar to that of figure 1 but with the number of state replaced by the optimal
action at that state. The optimal actions should be displayed using arrows.
Does the optimal policy of the agent match your intuition? Please provide a
brief explanation. Is it possible for the agent to compute the optimal action to
take at each state by observing the optimal values of it’s neighboring states? In
this question, you should have 1 plot.
Question 6: (10 points) Modify the environment of the agent by replacing Reward function 1 with Reward function 2. Use the optimal state-value function
implemented in question 2 to compute the optimal value of each state in the
grid. For visualization purpose, you should generate a figure similar to that
of figure 1 but with the number of state replaced by the optimal value of that
state. In this question, you should have 1 plot.
Question 7: (10 points) Generate a heat map of the optimal state values (found
in question 6) across the 2-D grid. For generating the heat map, you can use
the same function provided in the hint earlier.
Question 8: (20 points) Explain the distribution of the optimal state values
across the 2-D grid. (Hint: Use the figure generated in question 7 to explain)
Question 9: (20 points) Implement the computation step of the value iteration
algorithm (lines 14-17) to compute the optimal policy of the agent navigating
the 2-D state-space. For visualization purpose, you should generate a figure
similar to that of figure 1 but with the number of state replaced by the optimal
action at that state. The optimal actions should be displayed using arrows.
Does the optimal policy of the agent match your intuition? Please provide a
brief explanation. In this question, you should have 1 plot.
9
4 Inverse Reinforcement learning (IRL)
Inverse Reinforcement learning (IRL) is the task of learning an expert’s reward
function by observing the optimal behavior of the expert. The motivation for
IRL comes from apprenticeship learning. In apprenticeship learning, the goal of
the agent is to learn a policy by observing the behavior of an expert. This task
can be accomplished in two ways:
1. Learn the policy directly from expert behavior
2. Learn the expert’s reward function and use it to generate the optimal
policy
The second way is preferred because the reward function provides a much more
parsimonious description of behavior. Reward function, rather than the policy,
is the most succinct, robust, and transferable definition of the task. Therefore, extracting the reward function of an expert would help design more robust
agents.
In this part of the project, we will use IRL algorithm to extract the reward
function. We will use the optimal policy computed in the previous section as
the expert behavior and use the algorithm to extract the reward function of
the expert. Then, we will use the extracted reward function to compute the
optimal policy of the agent. We will compare the optimal policy of the agent
to the optimal policy of the expert and use some similarity metric between the
two to measure the performance of the IRL algorithm.
4.1 IRL algorithm
For finite state spaces, there are a couple of IRL algorithms for extracting the
reward function:
• Linear Programming (LP) formulation
• Maximum Entropy formulation
Since we covered LP formulation in the lecture and it is the simplest IRL algorithm, so we will use the LP formulation in this project. We will skip the
derivation of the algorithm (for details on the derivation please refer to the
lecture slides) here. The LP formulation of the IRL is given by equation 1
maximize
R,ti,ui
P|S|
i=1(ti ui)
subject to [(Pa1 (i) Pa(i))(I Pa1 )1R] ti, 8a 2 A \ a1, 8i
(Pa1 Pa)(I Pa1 )1R ⌫ 0, 8a 2 A \ a1
u R u
|Ri|  Rmax, i = 1, 2, ··· , |S|
(1)
In the LP given by equation 1, R is the reward vector (R(i) = R(si)), Pa is
the transition probability matrix, is the adjustable penalty coecient, and
ti’s and ui’s are the extra optimization variables (please note that u(i) = ui).
Use the maximum absolute value of the ground truth reward as Rmax. For the
10
ease of implementation, we can recast the LP in equation 1 into an equivalent
form given by equation 2 using block matrices.
maximize x cT x
subject to Dx 0, 8a 2 A \ a1
(2)
Question 10: (10 points) Express c, x, D in terms of R, Pa, Pa1 , ti, u, and
Rmax
4.2 Performance measure
In this project, we use a very simple measure to evaluate the performance of the
IRL algorithm. Before we state the performance measure, let’s introduce some
notation:
• OA(s): Optimal action of the agent at state s
• OE(s): Optimal action of the expert at state s

m(s) = (
1, OA(s) = OE(s)
0, else
Then with the above notation, accuracy is given by equation 3
Accuracy =
P
s2S m(s)
|S| (3)
Since we are using the optimal policy found in the previous section as the expert
behavior, so we will use the optimal policy found in the previous section to fill
the OE(s) values. Please note that these values will be di↵erent depending on
whether we used Reward Function 1 or Reward Function 2 to create the environment.
To compute OA(s), we will solve the linear program given by equation 2 to
extract the reward function of the expert. For solving the linear program you
can use the LP solver in python (from cvxopt import solvers and then use
solvers.lp). Then, we will use the extracted reward function to compute the optimal policy of the agent using the value iteration algorithm you implemented
in the previous section. The optimal policy of the agent found in this manner
will be used to fill the OA(s) values. Please note that these values will depend
on the adjustable penalty coecient . We will tune to maximize the accuracy.
Question 11: (30 points) Sweep from 0 to 5 to get 500 evenly spaced values for . For each value of compute OA(s) by following the process described
above. For this problem, use the optimal policy of the agent found in question
5 to fill in the OE(s) values. Then use equation 3 to compute the accuracy of
the IRL algorithm for this value of . You need to repeat the above process
for all 500 values of to get 500 data points. Plot (x-axis) against Accuracy
(y-axis). In this question, you should have 1 plot.
11
Question 12: (5 points) Use the plot in question 11 to compute the value of
for which accuracy is maximum. For future reference we will denote this
value as (1)
max
Question 13: (15 points) For (1)
max, generate heat maps of the ground truth
reward and the extracted reward. Please note that the ground truth reward
is the Reward function 1 and the extracted reward is computed by solving the
linear program given by equation 2 with the parameter set to (1)
max. In this
question, you should have 2 plots.
Question 14: (10 points) Use the extracted reward function computed in question 13, to compute the optimal values of the states in the 2-D grid. For computing the optimal values you need to use the optimal state-value function that
you wrote in question 2. For visualization purpose, generate a heat map of
the optimal state values across the 2-D grid (similar to the figure generated in
question 3). In this question, you should have 1 plot.
Question 15: (10 points) Compare the heat maps of Question 3 and Question 14 and provide a brief explanation on their similarities and di↵erences.
Question 16: (10 points) Use the extracted reward function found in question 13
to compute the optimal policy of the agent. For computing the optimal policy
of the agent you need to use the function that you wrote in question 5. For visualization purpose, you should generate a figure similar to that of figure 1 but
with the number of state replaced by the optimal action at that state. The actions should be displayed using arrows. In this question, you should have 1 plot.
Question 17: (10 points) Compare the figures of Question 5 and Question 16
and provide a brief explanation on their similarities and di↵erences.
Question 18: (30 points) Sweep from 0 to 5 to get 500 evenly spaced values for . For each value of compute OA(s) by following the process described
above. For this problem, use the optimal policy of the agent found in question
9 to fill in the OE(s) values. Then use equation 3 to compute the accuracy of
the IRL algorithm for this value of . You need to repeat the above process
for all 500 values of to get 500 data points. Plot (x-axis) against Accuracy
(y-axis). In this question, you should have 1 plot.
Question 19: (5 points) Use the plot in question 18 to compute the value of
for which accuracy is maximum. For future reference we will denote this
value as (2)
max
Question 20: (15 points) For (2)
max, generate heat maps of the ground truth
reward and the extracted reward. Please note that the ground truth reward
is the Reward function 2 and the extracted reward is computed by solving the
linear program given by equation 2 with the parameter set to (2)
max. In this
question, you should have 2 plots.
Question 21: (10 points) Use the extracted reward function computed in ques12
tion 20, to compute the optimal values of the states in the 2-D grid. For computing the optimal values you need to use the optimal state-value function that
you wrote in question 2. For visualization purpose, generate a heat map of
the optimal state values across the 2-D grid (similar to the figure generated in
question 7). In this question, you should have 1 plot.
Question 22: (10 points) Compare the heat maps of Question 7 and Question 21 and provide a brief explanation on their similarities and di↵erences.
Question 23: (10 points) Use the extracted reward function found in question 20
to compute the optimal policy of the agent. For computing the optimal policy
of the agent you need to use the function that you wrote in question 9. For visualization purpose, you should generate a figure similar to that of figure 1 but
with the number of state replaced by the optimal action at that state. The actions should be displayed using arrows. In this question, you should have 1 plot.
Question 24: (10 points) Compare the figures of Question 9 and Question 23
and provide a brief explanation on their similarities and di↵erences.
Question 25: (50 points) From the figure in question 23, you should observe
that the optimal policy of the agent has two major discrepancies. Please identify and provide the causes for these two discrepancies. One of the discrepancy
can be fixed easily by a slight modification to the value iteration algorithm.
Perform this modification and re-run the modified value iteration algorithm to
compute the optimal policy of the agent. Also, recompute the maximum accuracy after this modification. Is there a change in maximum accuracy? The
second discrepancy is harder to fix and is a limitation of the simple IRL algorithm. If you can provide a solution to the second discrepancy then we will give
you a bonus of 50 points.
5 Submission
Please submit a zip file containing your codes and report to CCLE.
The zip file should be named as ”Project2 UID1 … UIDn.zip” where
UIDx are student ID numbers of team members.
13