## Description

A Markov decision Process is defined by (S, A, R, P, γ), where S denotes the set of possible states, A denotes the

set of possible actions, R denotes distribution of reward given (state, action) pair, P denotes transition probability,

and γ is a discount factor. In this homework, given a simple 5 × 5 grid game (see below), the upper left (i.e. state

1-1) and lower right (i.e. state 5-5) corners are terminal states.

Therefore, there are 5×5 = 25 states (i.e. |S| = 25)

and each is denoted as s(i, j) where i and j represent i-th row and j-th column, respectively. Four possible actions

are {right, lef t, up, down}. We set a negative reward r = −5 for each transition , a discount factor γ = 0.7, and

the probability of an initial state p(s0) equals 1

25 . Our goal is to reach one of the terminal states (smiling states)

in least number of actions. In other words, it aims to find the optimized policy π

∗

that maximized cumulative

discounted reward. The initial Q table can be randomly defined by you.

## Q1 [Value iteration algorithm]

Using the Bellman’s equation, implement value iteration algorithm to iteratively

update the Q table until convergence.

## Q2 [Deep Q-learning]

This question functions the same as question 1. The only difference is to replace with

a simple four-layer (i.e. three hidden layers and one output layer) fully-connected deep neural network to

approximate the Q table. The three hidden layers contain 32, 8, and 16 neurons respectively and all applied

ReLU activation functions. The output function is linear.

## Q3 [Deep Q-learning with experience replay, bonus question]

This question functions the same as question 2,

except that you need to implement the Deep Q-learning with experience replay. Note that if you finish this

bonus question, the points that you obtained in this HW will be doubled.

Both two questions are programming assignments. Please use Pytorch for the implementation of

the second one. Please submit your code along with initial and finial Q-tables for both questions.

1