## Description

1 Introduction

Reinforcement Learning (RL) is the task of learning from interaction to achieve

a goal. The learner and the decision maker is called the agent. The thing

it interacts with, comprising everything outside the agent, is called the environment. These interact continually, the agent selecting actions and the

environment responding to those actions by presenting rewards and new states.

In the first part of the project, we will learn the optimal policy of an agent

navigating in a 2-D environment. We will implement the Value iteration algorithm to learn the optimal policy.

Inverse Reinforcement Learning (IRL) is the task of extracting an expert’s reward function by observing the optimal policy of the expert. In the second

part of the project, we will explore the application of IRL in the context of

apprenticeship learning.

2 Reinforcement learning (RL)

The two main objects in Reinforcement learning are:

• Agent

• Environment

In this project, we will learn the optimal policy of a single agent navigating in

a 2-D environment.

2.1 Environment

In this project, we assume that the environment of the agent is modeled by a

Markov Decision Process (MDP). In a MDP, agents occupy a state of the

environment and perform actions to change the state they are in. After taking

an action, they are given some representation of the new state and some reward

value associated with the new state.

1

An MDP formally is a tuple (S, A,Pa

ss0 , Ra

ss0 , ) where:

• S is a set of states

• A is a set of actions

• Pa

ss0 is a set of transition probabilities, where Pa

ss0 is the probability of

transitioning from state s 2 S to state s0 2 S after taking action a 2 A

– Pa

ss0 = P(st+1 = s0

|st = s, at = a)

• Given any current state and action, s and a, together with any next state,

s0

, the expected value of the next reward is Ra

ss0

– Ra

ss0 = E(rt+1|st = s, at = a, st+1 = s0

)

• 2 [0, 1) is the discount factor, and it is used to compute the present

value of future reward

– If is close to 1 then the future rewards are discounted less

– If is close to 0 then the future rewards are discounted more

In the next few subsections, we will discuss the parameters that will be used to

generate the environment for the project.

2.1.1 State space

In this project, we consider the state space to be a 2-D square grid with 100

states. The 2-D square grid along with the numbering of the states is shown in

figure 1

Figure 1: 2-D square grid with state numbering

2.1.2 Action set

In this project, we consider the action set(A) to contain the 4 following actions:

• Move Right

• Move Left

2

• Move Up

• Move Down

The 4 types of actions are displayed in figure 2

Figure 2: 4 types of action

From the above figure, we can see that the agent can take 4 actions from

the state marked with a dot.

2.1.3 Transition probabilities

In this project, we define the transition probabilities in the following manner:

1. If state s0 and s are not neighboring states in the 2-D grid, then

P(st+1 = s0

|st = s, at = a)=0

s0 and s are neighbors in the 2-D grid if you can move to s0 from s by

taking an action a from the action set A. We will consider a state s to be

a neighbor of itself. For example, from figure 1 we can observe that states

1 and 11 are neighbors (we can transition from 1 to 11 by moving right)

but states 1 and 12 are not neighbors.

2. Each action corresponds to a movement in the intended direction with

probability 1 w, but has a probability of w of moving in a random

direction instead due to wind. To illustrate this, let’s consider the states

shown in figure 3

3

Figure 3: Inner grid states (Non-boundary states)

The transition probabilities for the non-boundary states shown in figure

3 are given below:

P(st+1 = 43|st = 44, at =”)=1 w +

w

4

P(st+1 = 34|st = 44, at =”) = w

4

P(st+1 = 54|st = 44, at =”) = w

4

P(st+1 = 45|st = 44, at =”) = w

4

From the above calculation it can be observed that if the agent is at a nonboundary state then it has 4 neighbors excluding itself and the probability

w is uniformly distributed over the 4 neighbors. Also, if the agent is at

a non-boundary state then it transitions to a new state after taking an

action (P(st+1 = 44|st = 44, at =”) = 0)

3. If the agent is at one of the four corner states (0,9,90,99), the agent stays

at the current state if it takes an action to move o↵ the grid or is blown

o↵ the grid by wind. The actions can be divided into two categories:

• Action to move o↵ the grid

• Action to stay in the grid

To illustrate this, let’s consider the states shown in figure 4

4

Figure 4: Corner states

The transition probabilities for taking an action to move o↵ the grid are

given below:

P(st+1 = 10|st = 0, at =”) = w

4

P(st+1 = 1|st = 0, at =”) = w

4

P(st+1 = 0|st = 0, at =”)=1 w +

w

4 +

w

4

The transition probabilities for taking an action to stay in the grid are

given below:

P(st+1 = 10|st = 0, at =!)=1 w +

w

4

P(st+1 = 1|st = 0, at =!) = w

4

P(st+1 = 0|st = 0, at =!) = w

4 +

w

4

At a corner state, you can be blown o↵ the grid in two directions. As a

result, we have P(st+1 = 0|st = 0, at =!) = w

4 + w

4 since we can be blown

o↵ the grid in two directions and in both the cases we stay at the current

state.

4. If the agent is at one of the edge states, the agent stays at the current

state if it takes an action to move o↵ the grid or is blown o↵ the grid by

wind. The actions can be divided into two categories:

• Action to move o↵ the grid

• Action to stay in the grid

To illustrate this, let’s consider the states shown in figure 5

5

Figure 5: Edge states

The transition probabilities for taking an action to move o↵ the grid are

given below:

P(st+1 = 0|st = 1, at = ) = w

4

P(st+1 = 11|st = 1, at = ) = w

4

P(st+1 = 2|st = 1, at = ) = w

4

P(st+1 = 1|st = 1, at = )=1 w +

w

4

The transition probabilities for taking an action to stay in the grid are

given below:

P(st+1 = 0|st = 1, at =”)=1 w +

w

4

P(st+1 = 11|st = 1, at =”) = w

4

P(st+1 = 2|st = 1, at =”) = w

4

P(st+1 = 1|st = 1, at =”) = w

4

At an edge state, you can be blown o↵ the grid in one direction. As a

result, we have P(st+1 = 1|st = 1, at =”) = w

4 since we can be blown o↵

the grid in one direction and in that case we stay at the current state.

The main di↵erence between a corner state and an edge state is that a corner

state has 2 neighbors and an edge state has 3 neighbors.

2.1.4 Reward function

To simplify the project, we will assume that the reward function is independent

of the current state (s) and the action that you take at the current state (a).

To be specific, reward function only depends on the state that you transition to

(s0

). With this simplification, we have

Ra

ss0 = R(s0

)

6

In this project, we will learn the optimal policy of an agent for two di↵erent

reward functions:

• Reward function 1

• Reward function 2

The two di↵erent reward functions are displayed in figures 6 and 7 respectively

Figure 6: Reward function 1

Figure 7: Reward function 2

Question 1: (10 points) For visualization purpose, generate heat maps of

Reward function 1 and Reward function 2. For the heat maps, make sure you

display the coloring scale. You will have 2 plots for this question

For solving question 1, you might find the following function useful:

https://matplotlib.org/api/_as_gen/matplotlib.pyplot.pcolor.html

7

3 Optimal policy learning using RL algorithms

In this part of the project, we will use reinforcement learning (RL) algorithm

to find the optimal policy. The main steps in RL algorithm are:

• Find optimal state-value or action-value

• Use the optimal state-value or action-value to determine the deterministic

optimal policy

There are a couple of RL algorithms, but we will use the Value iteration algorithm since it was discussed in detail in the lecture. We will skip the derivation

of the algorithm here because it was covered in the lecture (for the derivation

details please refer to the lecture slides on Reinforcement learning). We will just

reproduce the algorithm below for the ease of implementation:

1: procedure Value Iteration(Pa

ss0 , Ra

ss0 , S, A, ):

2: for all s 2 S do . Initialization

3: V (s) 0

4: end for

5: 1

6: while > ✏ do . Estimation

7: 0

8: for all s 2 S do

9: v V (s);

10: V (s) max

a2A

P

s02S Pa

ss0 [Ra

ss0 + V (s0

)];

11: max(, |v V (s)|);

12: end for

13: end while

14: for all s 2 S do . Computation

15: ⇡(s) arg max

a2A

P

s02S Pa

ss0 [Ra

ss0 + V (s0

)];

16: end for

17: end procedure return ⇡

Question 2: (40 points) Create the environment of the agent using the information provided in section 2. To be specific, create the MDP by setting up

the state-space, action set, transition probabilities, discount factor, and reward

function. For creating the environment, use the following set of parameters:

• Number of states = 100 (state space is a 10 by 10 square grid as displayed

in figure 1)

• Number of actions = 4 (set of possible actions is displayed in figure 2)

• w = 0.1

• Discount factor = 0.8

• Reward function 1

8

After you have created the environment, then write an optimal state-value function that takes as input the environment of the agent and outputs the optimal

value of each state in the grid. For the optimal state-value function, you have

to implement the Initialization (lines 2-4) and Estimation (lines 5-13) steps of

the Value Iteration algorithm. For the estimation step, use ✏ = 0.01. For visualization purpose, you should generate a figure similar to that of figure 1 but

with the number of state replaced by the optimal value of that state. In this

question, you should have 1 plot.

Question 3: (5 points) Generate a heat map of the optimal state values across

the 2-D grid. For generating the heat map, you can use the same function provided in the hint earlier (see the hint after question 1).

Question 4: (15 points) Explain the distribution of the optimal state values

across the 2-D grid. (Hint: Use the figure generated in question 3 to explain)

Question 5: (30 points) Implement the computation step of the value iteration

algorithm (lines 14-17) to compute the optimal policy of the agent navigating

the 2-D state-space. For visualization purpose, you should generate a figure

similar to that of figure 1 but with the number of state replaced by the optimal

action at that state. The optimal actions should be displayed using arrows.

Does the optimal policy of the agent match your intuition? Please provide a

brief explanation. Is it possible for the agent to compute the optimal action to

take at each state by observing the optimal values of it’s neighboring states? In

this question, you should have 1 plot.

Question 6: (10 points) Modify the environment of the agent by replacing Reward function 1 with Reward function 2. Use the optimal state-value function

implemented in question 2 to compute the optimal value of each state in the

grid. For visualization purpose, you should generate a figure similar to that

of figure 1 but with the number of state replaced by the optimal value of that

state. In this question, you should have 1 plot.

Question 7: (10 points) Generate a heat map of the optimal state values (found

in question 6) across the 2-D grid. For generating the heat map, you can use

the same function provided in the hint earlier.

Question 8: (20 points) Explain the distribution of the optimal state values

across the 2-D grid. (Hint: Use the figure generated in question 7 to explain)

Question 9: (20 points) Implement the computation step of the value iteration

algorithm (lines 14-17) to compute the optimal policy of the agent navigating

the 2-D state-space. For visualization purpose, you should generate a figure

similar to that of figure 1 but with the number of state replaced by the optimal

action at that state. The optimal actions should be displayed using arrows.

Does the optimal policy of the agent match your intuition? Please provide a

brief explanation. In this question, you should have 1 plot.

9

4 Inverse Reinforcement learning (IRL)

Inverse Reinforcement learning (IRL) is the task of learning an expert’s reward

function by observing the optimal behavior of the expert. The motivation for

IRL comes from apprenticeship learning. In apprenticeship learning, the goal of

the agent is to learn a policy by observing the behavior of an expert. This task

can be accomplished in two ways:

1. Learn the policy directly from expert behavior

2. Learn the expert’s reward function and use it to generate the optimal

policy

The second way is preferred because the reward function provides a much more

parsimonious description of behavior. Reward function, rather than the policy,

is the most succinct, robust, and transferable definition of the task. Therefore, extracting the reward function of an expert would help design more robust

agents.

In this part of the project, we will use IRL algorithm to extract the reward

function. We will use the optimal policy computed in the previous section as

the expert behavior and use the algorithm to extract the reward function of

the expert. Then, we will use the extracted reward function to compute the

optimal policy of the agent. We will compare the optimal policy of the agent

to the optimal policy of the expert and use some similarity metric between the

two to measure the performance of the IRL algorithm.

4.1 IRL algorithm

For finite state spaces, there are a couple of IRL algorithms for extracting the

reward function:

• Linear Programming (LP) formulation

• Maximum Entropy formulation

Since we covered LP formulation in the lecture and it is the simplest IRL algorithm, so we will use the LP formulation in this project. We will skip the

derivation of the algorithm (for details on the derivation please refer to the

lecture slides) here. The LP formulation of the IRL is given by equation 1

maximize

R,ti,ui

P|S|

i=1(ti ui)

subject to [(Pa1 (i) Pa(i))(I Pa1 )1R] ti, 8a 2 A \ a1, 8i

(Pa1 Pa)(I Pa1 )1R ⌫ 0, 8a 2 A \ a1

u R u

|Ri| Rmax, i = 1, 2, ··· , |S|

(1)

In the LP given by equation 1, R is the reward vector (R(i) = R(si)), Pa is

the transition probability matrix, is the adjustable penalty coecient, and

ti’s and ui’s are the extra optimization variables (please note that u(i) = ui).

Use the maximum absolute value of the ground truth reward as Rmax. For the

10

ease of implementation, we can recast the LP in equation 1 into an equivalent

form given by equation 2 using block matrices.

maximize x cT x

subject to Dx 0, 8a 2 A \ a1

(2)

Question 10: (10 points) Express c, x, D in terms of R, Pa, Pa1 , ti, u, and

Rmax

4.2 Performance measure

In this project, we use a very simple measure to evaluate the performance of the

IRL algorithm. Before we state the performance measure, let’s introduce some

notation:

• OA(s): Optimal action of the agent at state s

• OE(s): Optimal action of the expert at state s

•

m(s) = (

1, OA(s) = OE(s)

0, else

Then with the above notation, accuracy is given by equation 3

Accuracy =

P

s2S m(s)

|S| (3)

Since we are using the optimal policy found in the previous section as the expert

behavior, so we will use the optimal policy found in the previous section to fill

the OE(s) values. Please note that these values will be di↵erent depending on

whether we used Reward Function 1 or Reward Function 2 to create the environment.

To compute OA(s), we will solve the linear program given by equation 2 to

extract the reward function of the expert. For solving the linear program you

can use the LP solver in python (from cvxopt import solvers and then use

solvers.lp). Then, we will use the extracted reward function to compute the optimal policy of the agent using the value iteration algorithm you implemented

in the previous section. The optimal policy of the agent found in this manner

will be used to fill the OA(s) values. Please note that these values will depend

on the adjustable penalty coecient . We will tune to maximize the accuracy.

Question 11: (30 points) Sweep from 0 to 5 to get 500 evenly spaced values for . For each value of compute OA(s) by following the process described

above. For this problem, use the optimal policy of the agent found in question

5 to fill in the OE(s) values. Then use equation 3 to compute the accuracy of

the IRL algorithm for this value of . You need to repeat the above process

for all 500 values of to get 500 data points. Plot (x-axis) against Accuracy

(y-axis). In this question, you should have 1 plot.

11

Question 12: (5 points) Use the plot in question 11 to compute the value of

for which accuracy is maximum. For future reference we will denote this

value as (1)

max. Please report (1)

max

Question 13: (15 points) For (1)

max, generate heat maps of the ground truth

reward and the extracted reward. Please note that the ground truth reward

is the Reward function 1 and the extracted reward is computed by solving the

linear program given by equation 2 with the parameter set to (1)

max. In this

question, you should have 2 plots.

Question 14: (10 points) Use the extracted reward function computed in question 13, to compute the optimal values of the states in the 2-D grid. For computing the optimal values you need to use the optimal state-value function that

you wrote in question 2. For visualization purpose, generate a heat map of

the optimal state values across the 2-D grid (similar to the figure generated in

question 3). In this question, you should have 1 plot.

Question 15: (10 points) Compare the heat maps of Question 3 and Question 14 and provide a brief explanation on their similarities and di↵erences.

Question 16: (10 points) Use the extracted reward function found in question 13

to compute the optimal policy of the agent. For computing the optimal policy

of the agent you need to use the function that you wrote in question 5. For visualization purpose, you should generate a figure similar to that of figure 1 but

with the number of state replaced by the optimal action at that state. The actions should be displayed using arrows. In this question, you should have 1 plot.

Question 17: (10 points) Compare the figures of Question 5 and Question 16

and provide a brief explanation on their similarities and di↵erences.

Question 18: (30 points) Sweep from 0 to 5 to get 500 evenly spaced values for . For each value of compute OA(s) by following the process described

above. For this problem, use the optimal policy of the agent found in question

9 to fill in the OE(s) values. Then use equation 3 to compute the accuracy of

the IRL algorithm for this value of . You need to repeat the above process

for all 500 values of to get 500 data points. Plot (x-axis) against Accuracy

(y-axis). In this question, you should have 1 plot.

Question 19: (5 points) Use the plot in question 18 to compute the value of

for which accuracy is maximum. For future reference we will denote this

value as (2)

max. Please report (2)

max

Question 20: (15 points) For (2)

max, generate heat maps of the ground truth

reward and the extracted reward. Please note that the ground truth reward

is the Reward function 2 and the extracted reward is computed by solving the

linear program given by equation 2 with the parameter set to (2)

max. In this

question, you should have 2 plots.

Question 21: (10 points) Use the extracted reward function computed in ques12

tion 20, to compute the optimal values of the states in the 2-D grid. For computing the optimal values you need to use the optimal state-value function that

you wrote in question 2. For visualization purpose, generate a heat map of

the optimal state values across the 2-D grid (similar to the figure generated in

question 7). In this question, you should have 1 plot.

Question 22: (10 points) Compare the heat maps of Question 7 and Question 21 and provide a brief explanation on their similarities and di↵erences.

Question 23: (10 points) Use the extracted reward function found in question 20

to compute the optimal policy of the agent. For computing the optimal policy

of the agent you need to use the function that you wrote in question 9. For visualization purpose, you should generate a figure similar to that of figure 1 but

with the number of state replaced by the optimal action at that state. The actions should be displayed using arrows. In this question, you should have 1 plot.

Question 24: (10 points) Compare the figures of Question 9 and Question 23

and provide a brief explanation on their similarities and di↵erences.

Question 25: (50 points) From the figure in question 23, you should observe

that the optimal policy of the agent has two major discrepancies. Please identify and provide the causes for these two discrepancies. One of the discrepancy

can be fixed easily by a slight modification to the value iteration algorithm.

Perform this modification and re-run the modified value iteration algorithm to

compute the optimal policy of the agent. Also, recompute the maximum accuracy after this modification. Is there a change in maximum accuracy? The

second discrepancy is harder to fix and is a limitation of the simple IRL algorithm. If you can provide a solution to the second discrepancy then we will give

you a bonus of 50 points.

5 Submission

Please submit a zip file containing your codes and report to CCLE.

The zip file should be named as ”Project2 UID1 … UIDn.zip” where

UIDx are student ID numbers of team members.

13