CMPUT366 Project p1: Blackjack (in python) solution




5/5 - (3 votes)

In this assignment you will gain experience programming a learning agent and its
interaction with an MDP. We will use a blackjack MDP similar to that described in
Examples 5.1 and 5.3 in the book.
One difference is that the dealer policy is different. Instead of sticking on 17 or higher,
the dealer now sticks on 16 or higher. The numbers for average returns given later in
this document are for the original dealer. Your results should be slightly different.
Another difference is that we introduce a special additional state corresponding to the
beginning of a blackjack game. There is only one action available in that state, and it
results in a transition to one of the regular states treated in the book . We include this
initial state so that you can learn its value, which will be the overall value for playing the
game. The initial state is numbered 0, and the regular states are numbered 1-180. (All
actions are the same in state 0.)
There are 180 regular states because they correspond to a 3-tuple: (playerSum,
dealerShowing, usableAce), with the player sum between 12 and 20, the dealer
showing between 1 and 10, and usableAce being binary. That is, a second difference
from the blackjack MDP in the book is that here the player sum portion of the state is
never 21. On 21 the player is assumed to stick and is not even given a chance to
choose an action, just as on player sums less than 12 the player is assumed to hit.
Thus, an episode’s state sequence consists of 0 followed by some states numbered
between 1 and 180, then finally the terminal state represented as a state numbered –1.
The two action, permitted in all non-terminal states, are 0 for Stick and 1 for Hit.
Your task is to implement the one-step Expected Sarsa algorithm applied to this
problem. Basically, you have to make a python implementation of the boxed algorithm in
Figure 6.10 (Q-learning) of the Sutton and Barto textbook and then modify it to change
from Q-learning to Expected Sarsa.
We provide the blackjack MDP in the form of a single file (in the dropbox),
which you will download and place in the directory in which you are working. Do not
change this file. You should then have access to the three functions:
• blackjack.init(), which takes no arguments and returns the initial state (0).
This method starts a new game of blackjack
• blackjack.sample(S,A) –> (R,S’), which returns a tuple of two integers, the
first of which is the sample reward and the second of which is the sample next
state, from nonterminal state S given that action A is taken. Arrival in the terminal
state is indicated by the next state being –1. In our version of blackjack, there are
exactly two actions (0 and 1, for Stick and Hit) possible in all nonterminal states.
• blackjack.printPolicy(policy), which takes a deterministic function from
{1,…,180} to {0,1} specifying the action to take in each non-terminal state and
prints out a representation of the corresponding policy.
Here are some Python hints that may be useful in doing this project: 1) there are
functions available called randint, max, and argmax; 2) you can make a 10×10 2-
dimensional array X of small random numbers with X = 0.00001*rand(10,10); 3) you
can assign variables x, y, and z to the parts of a tuple by x,y,z = tuple, where tuple is a
tuple of three elements; and 4) there is nothing wrong with global variables and simply
putting your main code in the file to be executed without bundling it up into a function.
The assignment has three parts:
1. First implement the equiprobable-random policy, run a number episodes and observe
the returns from the initial state (assuming γ=1). These returns should all be either -1,
0, or +1, with an average of about -0.3 or so. If they don’t, then you are probably
doing something wrong. Create this code by modifying the provided file
2. Now copy your file to produce a new file In it,
implement Expected Sarsa with both the behavior and target behavior policies 휀greedy with two epsilons, 휀휇 and 휀π. Set α=0.001 and initialize the action values to
small random values. As a first check, simply run for many episodes with 휀휇 =1.0 and
measure the average return as you did in part 1. Obviously, you should get the same
average reward as in part 1. Now set 휀휇 =휀π=0.01 and run for perhaps 1,000,000
episodes observing the average return every 10,000 episodes, which should increase
over episodes as learning progresses up to about -0.085 or better. Of course, this is
the performance of the 휀π-greedy policy and you should be able to do better if you
deterministically follow your learned policy. After learning, print out your learned policy
using printPolicy. One way to assess the quality of your policy is how similar it
looks to that given in the textbook. A better way is to try it. Run your deterministic
learned policy (the greedy policy with 휀휇 =0) for an additional 1,000,000 (or
10,000,000 if you are going for the extra credit) episodes without learning (e.g., with
α=0) and report the average return.
3. Now experiment with the settings of α, 휀휇, 휀π, and the number of episodes to find a
setting that reliably produces a better policy than that obtained in part 2. Report the
settings, the final policy (by printPolicy) and the reliable performance level
obtained by running deterministically as described at the end of part 2.
You should turn in a zip file through eclass that contains three files. The first two are
your files and The third is a pdf file results.pdf with
a) the final policy and its deterministic performance level from part 2, and b) the best
setting (values of α, 휀휇, 휀π, and the number of episodes) that you were able to find in
part 3, along with the policy (by printPolicy) and its performance level (by running
deterministically as described at the end of part 2).
Extra credit will be given to the three students or teams that find the best policies by the
due date. First, second, and third place will receive extra credit equal to 10%, 7%, and
4% of the points available on this project. To be eligible for the extra credit, you must run
your final policy for 10,000,000 episodes, then compute and report the average return. If
there are ties, then something creative will be done.