## Description

Problem 1: Value Iteration

In this problem, you will perform the value iteration updates manually on a very basic game

just to solidify your intuitions about solving MDPs. The set of possible states in this game

is -2, -1, 0, 1, 2. You start at state 0, and if you reach either -2 or 2, the game ends. At each

state, you can take one of two actions: -1, +1.

If you’re in state s and choose -1:

You have an 80% chance of reaching the state s1.

You have a 20% chance of reaching the state s+1.

If you’re in state s and choose +1:

You have a 30% chance of reaching the state s+1.

You have a 70% chance of reaching the state s1.

If your action results in transitioning to state -2, then you receive a reward of 20. If your

action results in transitioning to state 2, then your reward is 100. Otherwise, your reward

is -5. Assume the discount factor is 1.

(a) Give the value of Vopt(s) for each state s after 0, 1, and 2 iterations of value iteration.

Iteration 0 just initializes all the values of V to 0. Terminal states do not have any

optimal policies and take on a value of 0.

Using the following equation for determining the optimal value if take action a from

state s:

Qopt(s, a) = P

s

0 T(s, a, s0

)(Reward(s, a, s0

) + γVopt(s

0

))

Also, optimal value from state s:

Vopt = 0 if reach the end state

Vopt = maxa⊂Actions(s)Qopt(s, a) for everything else

1

Therefore: Vopt(−2) = Vopt(2) = 0

(b) What is the resulting optimal policy πopt for all non-terminal states?

Problem 2: Transforming MDPs

Let’s implement value iteration to compute the optimal policy on an arbitrary MDP. Later,

we’ll create the specific MDP for Blackjack.

(a) See modified counterexampleMDP

(b) Suppose we have an acyclic MDP. We could run value iteration, which would require

multiple iterations. Briefly explain a more efficient algorithm that only requires one

pass over all the (s, a, s0

) triples

(c) Suppose we have an MDP with states States a discount factor γ < 1, but we have an MDP solver that only can solve MDPs with discount 1. How can leverage the MDP solver to solve the original MDP? Problem 4: Learning to Play Blackjack So far, we’ve seen how MDP algorithms can take an MDP which describes the full dynamics of the game and return an optimal policy. But suppose you go into a casino, and no one tells you the rewards or the transitions. We will see how reinforcement learning can allow you to play the game and learn the rules at the same time! (a) See code (b) Call simulate using your algorithm and the identityFeatureExtractor() on the MDP smallMDP, with 30000 trials. Compare the policy learned in this case to the policy learned by value iteration. Don’t forget to set the explorationProb of your Q-learning algorithm to 0 after learning the policy. How do the two policies compare (i.e., for how many states do they produce a different action)? Now run simulate() on largeMDP. How does the policy learned in this case compare to the policy learned by value iteration? What went wrong? c) See Code (d) Now let’s explore the way in which value iteration responds to a change in the rules of the MDP. Run value iteration on originalMDP to compute an optimal policy. Then apply your policy to newThresholdMDP by calling simulate with FixedRLAlgorithm, instantiated using your computed policy. What reward do you get? What happens if you run Q learning on newThresholdMDP instead? Explain.