Reinforcement learning is an area of machine learning concerned with how software agents
ought to take actions in an environment so as to maximize some notion of cumulative reward.
This homework programming assignment consists of 2 parts.
In the first part, you are tasked
with solving a simple MDP using value iteration algorithm. In the second part, you are asked to
train a tic-tac-toe player using Q-learning.
Please use Python 3.6 or Python 3.7 to implement your homework assignment.
1. Homework Description
1.1 Task 1: 2D Maze (30 pts)
Your agent is trapped in a 2D Maze, and your are helping your agent to determine the policy to
get out of the maze.
There are obstacles in the maze that your agent should avoid. If your agent crashes into an
obstacle, 100 Health Points of your agent will be deducted (HP -100). Each time your agent
moves, 1 HP will also be deducted (HP -1). When your agent arrives at the destination, your
agent will receive 100 HP (HP +100). (Hint: think of HP as reward)
However, there is also uncertainty in the agent’s navigation. The agent will go in the correct
direction 70% of the time (10% in each other direction, including borders).
The goal is to compute the best policy given the maze using value iteration.
○ The lowercase letter x represents an obstacle
○ Period . represents destination
○ ^ v < > represents NORTH, SOUTH, WEST, EAST respectively
● Utilities here are just for demonstration, do NOT round in your algorithm
● It is guaranteed that there is NO tie in all test cases
Recall the bellman equation to calculate the expected utility starting in s and acting optimally:
And the value iteration algorithm:
Please use exact same values
● Discount factor gamma = 0.9
● Maximum error allowed epsilon = 0.1
● Initial conditions U0 = 0.0
Input and Output Format
The input file will be formatted as follows:
followed by number_of_obstacles lines of x, y #location of obstacles
x, y #location of destination
For the above mentioned example, the input is:
And your output should be :
NO blank line at the end of the file, \n for line break
You are given blackbox51.txt, blackbox52.txt for development, and your code will only be
graded with hidden test cases blackbox53.txt and blackbox54.txt
The program should be run in the following way:
python3 GridMDP.py input_file output_file
For example, when we grade with hidden blackbox53, we will run:
python3 GridMDP.py blackbox53.txt blackbox53-policy.txt
The score will be calculated as:
Correctness (Exact Same) blackbox53 (10pts) blackbox54 (20pts)
Correct 10 20
Wrong 0 0
Hint: Linux diff command or python filecmp.cmp() function can be used to compare two files
1.2 Task 2: Tic-Tac-Toe (70 pts)
In this part, you are asked to solve a real-life problem Tic-tac-toe. If you don’t remember it,
familiarize yourself with the rules (https://en.wikipedia.org/wiki/Tic-tac-toe )
You are given 6 files:
● Board.py: tic-tac-toe board, 3 by 3 gridx
[[1 1 0]
[0 0 0] —->
[0 2 2]]
● RandomPlayer.pyc, SmartPlayer.pyc, PerfectPlayer.pyc (cpython-3.6):
think of them as 3 blackboxes. You do not need to know how things work inside of each
player, but it may be helpful to know how they behave. See next section for more details
● QLearner.py: Q-Learning Player to be implemented
● TicTacToe.py: where all players will be called to play tic-tac-toe games and where your
QLearner will be trained and tested, i.e. grading script, simply run python3 TicTacToe.py
Your task is to complete QLearner.py. In QLearner.py, method move() and learn()
must be implemented, and the number of games GAME_NUM needed for training the Q-Learner
must be set. Any other helper functions can be added inside the QLearner class.
To see how these two methods and the variable GAME_NUM will be used, please see
TicTacToe.py for more details.
You are not supposed to modify Board.py, TicTacToe.py as well as other Players. The
only python file you need to edit is QLearner.py.
● RandomPlayer: moves randomly
● SmartPlayer: somehow better than RandomPlayer, but cannot beat PerfectPlayer
● PerfectPlayer: never lose
If we let these 3 players play against each other, game results will look like this:
Recall the formula:
Q(s,a) ← (1 – ⍺) Q(s,a) + ⍺ (R(s) + 𝛾 maxa’ Q(s’,a’))
You are encouraged to make any improvement to your Q-Learner in terms of speed and
performance to get prepared for the upcoming competition.
Hint: The rewards will only be assigned for the last action taken by the agent
You are free to choose values for all parameters, i.e. reward for WIN,DRAW,LOSE, learning rate,
discount factor and initial conditions.
This time, the grading script (TicTacToe.py) is given to you. After execution, the game results will
be printed out. See Figure 2.2
In summary section, you can see the Win/Draw rate against each opponent as well as the final
grade for task 2. The task 2 score is calculated as:
RandomPlayer (25pts) SmartPlayer (25pts) PerfectPlayer (20pts)
>= 95% 25 25 20
< 95% and >= 85% 15 * Rate 15 * Rate 10 * Rate
< 85% 0 0 0
If your player is not based on Q-Learning (e.g. rule-based), you will lose ALL points for task 2.
In your implementation, please do not use any existing machine learning library call. You
must implement the algorithm yourself. Please develop your code yourself and do not copy from
other students or from the Internet.
Your code will be autograded for technical correctness. Please name your file correctly, or you
will wreak havoc on the autograder. The maximum running time is 3 minutes for each task.
2.1 Submit your code to Vocareum
● Submit GridMDP.py, QLearner.py on Vocareum
● The program will terminate and fail if it exceeds the 3 minutes time limit.
● After the submission script/grading script finishes, you can view your submission report
immediately to see if your code works, while the grading report will be released after the
deadline of the homework.
● You don’t have to keep the page open while the scripts are running.
You do NOT need to submit your code or report to Blackboard. Your score on Vocareum will be
your final score. The submission instructions and examples are provided in Appendix A.
Appendix A: Vocareum Submission
This time your total score on Vocareum will be your final grade for assignment 5, so it is
important for you to fully understand how to submit your work and how final grades will be
There are 2 parts on Vocareum as well. Two parts are independent so that you can work on and
test them separately.
As usual, upload your GridMDP.py to the work directory.
After you submit the code, you will be able to see the submission report. The submission
script will only check blackbox51 and blackbox52.
Passed means you are correct on that blackbox
Otherwise, you will see Failed
After the due time, the grading script will check your code using blackbox53 and
blackbox54 to generate score for task1.
Only QLearner.py needs to be submitted.
All starter code (including Board.py, *Player.py and TicTacToe.py) are the SAME on
Vocareum, and you do not need to submit them.
If everything works fine, you should be able to see the submission report either in terminal or
under Details tab.
It is really important for you to know that the grade shown in the submission report is NOT your
final grade. To be fair, the grading script (exact SAME as submission script) will run
your Q-Learner.py again to generate the final score after the deadline.
After grades published, you can check your grades on Vocareum and the total score here will
be your final grade for this assignment.