Description
Question 1 [40 Points] (Multi-armed Bandit) In this programming question, we implement a simple
environment for the basic multi-armed bandit example of Chapter 1. Consider the example in Chapter
1 with a slight modification: we have two companies, namely A and B, that can selected for each day.
We plan to work for T days and each day we can decide to attend one of the companies. Company i ∈
{A, B} pays at the end of the day a random payment Ri which can be described by the following formula
Ri = max {50, Xi} ,
where Xi ∼ N
µi
, σ2
i
. We intend to implement these companies in a class called environment().
1. Write a class that models the behavior of this environment. As attributes, this class gets the mean
values and standard deviations of companies A and B, i.e., muA, sigA, muB, sigB with mu and
sig referring to mean and standard deviation, respectively. The class has further a method that
takes the string company , which is either ‘A’ or ‘B’ , as input and returns a random payment for
that company.
1 class environment():
2 def __init__(self, muA, sigA, muB, sigB):
3 pass
4
5 def pay(self, company):
6 # company == ’A’ or ’B’
7 pass
You can implement this part by completing the corresponding part in the assignment’s notebook.
Next, we implement two policies:
• Random Policy: On each day, we choose randomly a company with equal probability.
• ϵ-greedy Policy: For the first ⌊ϵT⌋ days we follow random policy. At the end of this training period,
we look at the average payment of each company. We then follow the following steps:
(i) Starting at day ⌊ϵT⌋+ 1, we choose the company whose average payment in the first ⌊ϵT⌋ days
was higher.
(ii) At the end of each day we update the average payment of the company that we worked in
that day.
(iii) At the next day, we choose the company whose average payment up to that day, i.e., after update
of step (ii), is higher.
In this assignment we set ϵ = 0.05.
2. Write function that takes the number of days NumDays (integer) and policy Policy (either string
‘Random’ or string ‘Greedy’ ) as input and returns two lists of size NumDays :
(a) list Payment whose entry i = 0, …,NumDays-1 denotes the average received payment by the
agent up to day i+1 .
(b) list Record whose entry i = 0, …,NumDays-1 denotes the company that the agent has
worked on day i+1 . To represent this list with integers denote company A with 1 and
company B with 0 in the list.
The function uses the implemented environment in Part 1 with some values for mean and standard
deviation values to generate the payment of each day.
Assignment 1: Basics of RL Submission Deadline: September 30, 2025 Page 4 of 5
1 def AvgPay(NumDays, Policy):
2 # env = environment(some values for means and standard deviation)
3 pass
4 # return Payment, Record
5
You can implement this part by completing the corresponding part in the assignment’s notebook.
3. Set the values for environment as follows:
1 muA, sigA, muB, sigB = 600, 100, 500, 200
Execute your code for each policy with NumDays=1000 (recall that ϵ = 0.05 in our implementation),
and plot both Payment and Record against the number of days.
4. Explain your observation.
Question 2 [20 Points] (Playing around with Gymnasium) In this assignment, we play with Gymnasium library. In the case that you have not yet installed Gymnasium, read the detailed documentation
including the information for installation at github.com/Farama-Foundation/Gymnasium.
In the assignment’s notebook, a code is included which plays the deterministic Frozen Lake game with
(a) random policy and (b) optimal policy. Use this code as a reference to answer the following items:
1. Make an environment for the Cart Pole setting. You could check the documentation given by the
Gymnasium library at gymnasium.farama.org/environments/classic_control/cart_pole/.
2. Play this game using a random policy which chooses the action in each time step at random with
equal probability.
3. Make an environment for the Cliff Walking game. You could check the documentation given by
the Gymnasium library at gymnasium.farama.org/environments/toy_text/cliff_walking/.
4. Play the Cliff Walking game using a random policy which chooses the action in each time step
randomly with equal probability.
5. Play the Cliff Walking game once again, this time using the optimal policy you found in Question
4 of Section 1.

