CMPSCI 689: Mini Project 3 solution

$29.99

Original Work ?

Download Details:

  • Name: doc-4ffqdf.zip
  • Type: zip
  • Size: 1.26 MB

Category: You will Instantly receive a download link upon Payment||Click Original Work Button for Custom work

Description

5/5 - (5 votes)

Question 1 (25 points)
Consider a Markov decision process (MDP) defined by M = (S, A, P, R, γ), where S is a finite set of states,
A is a set of actions, P is the transition matrix where P(s
0
|s, a) gives the distribution of future states s
0
conditioned on doing action a in state s, and R is the expected reward where R(s, a) is the reward for
doing action a in state s.
part a: (10 points) Assume that π : S → A is a fixed policy mapping states to actions for which we
desire to compute its associated value function V
π
. Bellman’s equation tells us that V
π
is the fixed point
of the Bellman operator
T
π
(V ) = R
π + γPπV
so that T(V
π
) = V
π
. Here, Rπ = R(s, π(s)) and P
π = P(s
0
|s, π(s)). However, when S is very large,
then it is preferable to compute an approximation Vˆ π = Φw, where Φ is a basis matrix where each column
is a basis function (e.g., radial basis function, polynomial, or some equivalent approximator). We can try
to find the weights w for the composition of the Bellman operator T
π and the orthogonal projector ΠΦ onto
the column space of the basis matrix Φ. In particular, show that the solution to the following minimization
problem
wF P = argminwkΠΦT
π
(V ) − V k
is given by the weights wF P = (ΦT
(I − γPπ
)Φ)−1Φ
T Rπ
.
part b: (15 points) Rather than computing the fixed point of the composition of the orthogonal
projector and the Bellman operator, a simpler approach is to solve the fixed point equation T(Φw) = Φw
in the least-squares sense. Show that the least-squares projection is defined by Vˆ = ΦwLS where
wLS = (ΦT
(I − γPπ
)
T
(I − γPπ
)Φ)−1Φ
T
(I − γPπ
)
T R
π
Question 2 (25 points)
The goal of this question is to read the paper Playing Atari with Deep Reinforcement Learning
(download it from https://arxiv.org/pdf/1312.5602v1.pdf) and answer the following questions.
1
Part a (10 points): Derive the gradient based Q-learning algorithm given in the paper as Equation 3.
Part b (5 points): What is the difference between the deep learning approach to Atari game playing
and the earlier feedforward neural net Q-learning technique for playing backgammon?
Part c ( 5 points): Why does Deep Mind use experience replay in their Atari game learner?
Part d (5 points): Comment on the graphs shown in Figure 2. What does each curve show, and are
there differences in the performance on Breakout vs. SeaQuest?
Question 3: Programming Project (50 points)
This question asks you to test a deep reinforcement learner that trains a simple 2D robot on a spatial
task that requires navigating through the environment and picking up rewarding objects. The code is
written entirely in Javascript and can be run from a browser. You can run the code from the web site
https://cs.stanford.edu/people/karpathy/convnetjs/demo/rldemo.html. The goal of this task is to
experiment with various parameters that specify the architecture of the deep learning as well as the learning
algorithm used (Q-learning). The parameters for this task are given as:
var tdtrainer_options = {learning_rate:0.001, momentum:0.0, batch_size:64, l2_decay:0.01};
opt.experience_size = 30000;
opt.start_learn_threshold = 1000;
opt.gamma = 0.7;
opt.learning_steps_total = 200000;
opt.learning_steps_burnin = 3000;
opt.epsilon_min = 0.05;
opt.epsilon_test_time = 0.05;
Part b (10 points): Vary the learning rate and momentum and plot the the improvement or lack of
improvement as a function of discount factor. You can measure performance using the average reward.
Part a (10 points): Vary the discount factor γ from 0.7 to 0.99, and measure the improvement or lack
of improvement as a function of discount factor. You can measure performance using the average reward.
Part b (10 points): Vary the experience size, and measure the improvement or lack of improvement as
a function of discount factor. You can measure performance using the average reward.
Part c (10 points): Vary the experience size, and measure the improvement or lack of improvement as
a function of discount factor. You can measure performance using the average reward.
Part d (10 points): Show examples of learned weights from your trained neural network.
2