FIT 5226辅导、辅导Python编程语言
- 首页 >> Algorithm 算法 (2022) Assignment 2
Reinforcement Learning
Due midnight, September 11, 2022
This assignment is for a total of 25 marks and accounts for 15% of your final mark.
A note on assignments and tutorials: During assignment periods tutorials are structured to give you
the maximum support for solving the assignment tasks. We will discuss the assignment in full detail
and guide you in solving similar tasks to give you a solid grounding for your own work. You can also
discuss your questions and ideas for the assignments with your tutors. Of course, there are limitations
- tutors can’t give you solutions but they can vet your ideas and approaches and they can help you to
overcome difficulties with coding.
The tutorial in Week 6 will lay the groundwork for defining an MDP and for solving it using an appropri-
ate library.
There are three bonus marks for the implementation part. Bonus questions are optional and can help
you make up for a shortfall in other tasks in this or other assignments. You can reach 25 marks (i.e.
100% of the assignment) without the bonus question and bonus marks will be added to the total of all
your assignment results.
You can submit solutions in Python or in Mathematica and your code must be submitted in the form of
a Jupyter Notebook or a Mathematica Notebook. If you are using Python you should use PyMDPtool-
box . We are only using value iteration to solve the MDP in this assignment. PyMDPtoolbox supports
this directly so that you will not have to implement this yourself. Note that q-learning appears to be
broken in the current version of PyMDPtoolbox. You will not have to use q-learning in this assignment.
If you prefer to do so, you are allowed to use any other MDP solver of your choice (at your own risk).
1. James Bot 007 [25 marks]
In the following you will define an MDP for a dynamic grid world and solve it using a predefined
MDP solver.
MI6 has decided that James Bond’s habit of fast cars and luxury is becoming too expensive: it’s time
to replace him with a robot. James Bond’s trusty old colleague Q, the inventor of the wonder
weapons arsenal, has been charged with the task of devising this robot, code-named James Bot.
Since it is expected that James Bot will have to face a laser chamber protecting a secret vault in its
first assignment, Q has decided to tackle this part of the problem first.
A laser chamber is a room that is protected by a laser-based security system. The security system
fires random laser beams into the room and if one of the beams hits the intruder, the alarm system
is triggered. The agent has to traverse this room from a given entry point to the position of the vault
without triggering the security system (which would have very unpleasant consequences, given the
vault does, of course, belong to a super villain).
If you have never been in a laser chamber yourself or at least seen one, you can take inspiration
from this link to a video on YouTube. Of course, James Bot cannot even dream of achieving this
level of elegance.
Like any good engineer, Q decides to to develop the AI for the bot by first tackling the problem in a
drastically simplified form. To this end, he defines a grid world of size with an entry point in the
bottom-right corner and the vault in the top-left corner. James Bot has to start at the entry point
and reach the vault without being hit by a laser beam. For each time step, the bot is allowed to
make a single 1-step move (which can be any of north, south, west, east, or rest). After the bot has
made its move, a single laser beam is fired into the room. If this beam hits the bot it is considered
dead. This ends the episode. It’s a terminal outcome from which the robot cannot recover. If it’s not
hit it can make its next move. See below for an illustration of point in time as the bot moves
through the chamber.
Laser beams go in a straight line and can be fired along a column, a row, or any diagonal (not just
the main diagonals). We will work in a 3x3 grid world. To clarify, here are all diagonals that follow
the direction of the major main diagonal:
beam type 1 beam type 2 beam type 3
beam type 4 beam type 5 beam type 6
2 Assignment 2.nb
beam type 7 beam type 8 beam type 9 beam type 10 beam type 11
beam type 12 beam type 13 beam type 14 beam type 15 beam type 16
For the purpose of this exercise the bot will need to reach the target in at most 10 steps. Exactly 10
beams will be fired during this process, one after each step. Q decides to select the directions of the
ten beams randomly before the training and make them available to the program at the start of the
training.
Your task is to help Q implement a learning mechanism for James Bot that can master this laser
chamber.
Tasks
Design [9 marks]
1.1 [1 mark] Define the action set for the robot.
1.2 [2 mark] Q suggests to simply use the position of the bot as the state, which he has seen done in
many other grid worlds. You suggest instead to use a tuple of position and time step. Give the
reasons why Q’s idea cannot work but yours does.
1.3 [3 marks] Give the reward definition in plain language taking into account that our solver will
only be able to handle infinite episodes. How will you handle this given that the robot is only
allowed a fixed number of steps?
1.4 [3 marks] Write down the full state-transition matrix and the reward function for the laser
chamber problem given in the following. Note that If you first solve the programming parts below,
you can do this automatically. If you don’t write the code you can still solve this sub-task manually
instead.
To make it possible for you to do this manually we restrict the problem for this sub-task (and only
this subtask) to a 2x2 grid. Start and vault are still in the bottom--right and top-left corners, respec-
tively. We give the beams fired in the order of time steps graphically below.
Out[? ]=
fired after step 1 fired after step 2 fired after step 3 fired after step 4 fired after step 5
Assignment 2.nb 3
Implementation [12 marks]
Q has heard of MDPs and value iteration, so he asks you to implement it this way. Note that you
need to map the (position, timestep) state tuples to plain state numbers for this. Your program
must work for a 3x3 grid world. beams will be given as a list of beam type numbers as defined
above.
1.5 [2 mark] Define a function reward(state, move, beams) that returns the reward for a given move
in a given state (see Subtask 1.2 for the definition of states) and a given list of beams.
1.6 [2 mark] Definite a function nextState(state, move, beams) that returns the next state for a given
move in a given state and a given list of beams.
1.7 [3 mark] Define a function transitionMatrix() that generates the complete transition matrix for
this MDP as a nested list of lists indexed by states.
1.8 [3 mark] Define a function rewardMatrix() that generates the complete transition matrix for this
MDP as a nested list of lists indexed by states and actions.
1.9 [2 mark] Define function approach(beams) that takes a list of beams as the argument and uses
value iteration in the PyMDPtoolbox solver to learn an optimal way to navigate the grid for the given
set of beams, which will be passed as a list of beam type numbers as explained above. You function
should return the optimal list of actions the robot takes.
[for 3 bonus marks] Extend your program to produce an animation that clearly shows the grid, the
robot’s position and the ray fired for each time step.
Critique of Q’s Design Thinking [4 marks]
1.10 [2 mark] After you have dutifully implemented Q’s idea to use value-iteration for this problem
you realise that this is a fundamentally flawed idea in the first place. It works in Q’s example task
but it will never be able to be used in practice. The robot is doomed to fail on its first mission. Give
the reasons why this is a fundamentally flawed idea.
1.11 [2 mark] After considering your thoughts on the use of value-iteration, Q goes back to the
drawing board and attempts to fix the problem by inventing a new learning algorithm. He calls this
“Q-learning” in honor of himself. This happens to be the same q-learning you know from our
course. (Of course, we have slightly deviated from the truth, Q-Learning was invented by Chris
Watkins in 1989). Part of Q’s design thinking for q-learning is right, but he forgot something else…
Give the reasons why the new James Bot AI equipped with Q-learning will still fail miserably in a
real situation when trying to conquer a new laser chamber.
... and so James Bond will remain indispensable for another sequel.
4 Assignment 2.nb
Reinforcement Learning
Due midnight, September 11, 2022
This assignment is for a total of 25 marks and accounts for 15% of your final mark.
A note on assignments and tutorials: During assignment periods tutorials are structured to give you
the maximum support for solving the assignment tasks. We will discuss the assignment in full detail
and guide you in solving similar tasks to give you a solid grounding for your own work. You can also
discuss your questions and ideas for the assignments with your tutors. Of course, there are limitations
- tutors can’t give you solutions but they can vet your ideas and approaches and they can help you to
overcome difficulties with coding.
The tutorial in Week 6 will lay the groundwork for defining an MDP and for solving it using an appropri-
ate library.
There are three bonus marks for the implementation part. Bonus questions are optional and can help
you make up for a shortfall in other tasks in this or other assignments. You can reach 25 marks (i.e.
100% of the assignment) without the bonus question and bonus marks will be added to the total of all
your assignment results.
You can submit solutions in Python or in Mathematica and your code must be submitted in the form of
a Jupyter Notebook or a Mathematica Notebook. If you are using Python you should use PyMDPtool-
box . We are only using value iteration to solve the MDP in this assignment. PyMDPtoolbox supports
this directly so that you will not have to implement this yourself. Note that q-learning appears to be
broken in the current version of PyMDPtoolbox. You will not have to use q-learning in this assignment.
If you prefer to do so, you are allowed to use any other MDP solver of your choice (at your own risk).
1. James Bot 007 [25 marks]
In the following you will define an MDP for a dynamic grid world and solve it using a predefined
MDP solver.
MI6 has decided that James Bond’s habit of fast cars and luxury is becoming too expensive: it’s time
to replace him with a robot. James Bond’s trusty old colleague Q, the inventor of the wonder
weapons arsenal, has been charged with the task of devising this robot, code-named James Bot.
Since it is expected that James Bot will have to face a laser chamber protecting a secret vault in its
first assignment, Q has decided to tackle this part of the problem first.
A laser chamber is a room that is protected by a laser-based security system. The security system
fires random laser beams into the room and if one of the beams hits the intruder, the alarm system
is triggered. The agent has to traverse this room from a given entry point to the position of the vault
without triggering the security system (which would have very unpleasant consequences, given the
vault does, of course, belong to a super villain).
If you have never been in a laser chamber yourself or at least seen one, you can take inspiration
from this link to a video on YouTube. Of course, James Bot cannot even dream of achieving this
level of elegance.
Like any good engineer, Q decides to to develop the AI for the bot by first tackling the problem in a
drastically simplified form. To this end, he defines a grid world of size with an entry point in the
bottom-right corner and the vault in the top-left corner. James Bot has to start at the entry point
and reach the vault without being hit by a laser beam. For each time step, the bot is allowed to
make a single 1-step move (which can be any of north, south, west, east, or rest). After the bot has
made its move, a single laser beam is fired into the room. If this beam hits the bot it is considered
dead. This ends the episode. It’s a terminal outcome from which the robot cannot recover. If it’s not
hit it can make its next move. See below for an illustration of point in time as the bot moves
through the chamber.
Laser beams go in a straight line and can be fired along a column, a row, or any diagonal (not just
the main diagonals). We will work in a 3x3 grid world. To clarify, here are all diagonals that follow
the direction of the major main diagonal:
beam type 1 beam type 2 beam type 3
beam type 4 beam type 5 beam type 6
2 Assignment 2.nb
beam type 7 beam type 8 beam type 9 beam type 10 beam type 11
beam type 12 beam type 13 beam type 14 beam type 15 beam type 16
For the purpose of this exercise the bot will need to reach the target in at most 10 steps. Exactly 10
beams will be fired during this process, one after each step. Q decides to select the directions of the
ten beams randomly before the training and make them available to the program at the start of the
training.
Your task is to help Q implement a learning mechanism for James Bot that can master this laser
chamber.
Tasks
Design [9 marks]
1.1 [1 mark] Define the action set for the robot.
1.2 [2 mark] Q suggests to simply use the position of the bot as the state, which he has seen done in
many other grid worlds. You suggest instead to use a tuple of position and time step. Give the
reasons why Q’s idea cannot work but yours does.
1.3 [3 marks] Give the reward definition in plain language taking into account that our solver will
only be able to handle infinite episodes. How will you handle this given that the robot is only
allowed a fixed number of steps?
1.4 [3 marks] Write down the full state-transition matrix and the reward function for the laser
chamber problem given in the following. Note that If you first solve the programming parts below,
you can do this automatically. If you don’t write the code you can still solve this sub-task manually
instead.
To make it possible for you to do this manually we restrict the problem for this sub-task (and only
this subtask) to a 2x2 grid. Start and vault are still in the bottom--right and top-left corners, respec-
tively. We give the beams fired in the order of time steps graphically below.
Out[? ]=
fired after step 1 fired after step 2 fired after step 3 fired after step 4 fired after step 5
Assignment 2.nb 3
Implementation [12 marks]
Q has heard of MDPs and value iteration, so he asks you to implement it this way. Note that you
need to map the (position, timestep) state tuples to plain state numbers for this. Your program
must work for a 3x3 grid world. beams will be given as a list of beam type numbers as defined
above.
1.5 [2 mark] Define a function reward(state, move, beams) that returns the reward for a given move
in a given state (see Subtask 1.2 for the definition of states) and a given list of beams.
1.6 [2 mark] Definite a function nextState(state, move, beams) that returns the next state for a given
move in a given state and a given list of beams.
1.7 [3 mark] Define a function transitionMatrix() that generates the complete transition matrix for
this MDP as a nested list of lists indexed by states.
1.8 [3 mark] Define a function rewardMatrix() that generates the complete transition matrix for this
MDP as a nested list of lists indexed by states and actions.
1.9 [2 mark] Define function approach(beams) that takes a list of beams as the argument and uses
value iteration in the PyMDPtoolbox solver to learn an optimal way to navigate the grid for the given
set of beams, which will be passed as a list of beam type numbers as explained above. You function
should return the optimal list of actions the robot takes.
[for 3 bonus marks] Extend your program to produce an animation that clearly shows the grid, the
robot’s position and the ray fired for each time step.
Critique of Q’s Design Thinking [4 marks]
1.10 [2 mark] After you have dutifully implemented Q’s idea to use value-iteration for this problem
you realise that this is a fundamentally flawed idea in the first place. It works in Q’s example task
but it will never be able to be used in practice. The robot is doomed to fail on its first mission. Give
the reasons why this is a fundamentally flawed idea.
1.11 [2 mark] After considering your thoughts on the use of value-iteration, Q goes back to the
drawing board and attempts to fix the problem by inventing a new learning algorithm. He calls this
“Q-learning” in honor of himself. This happens to be the same q-learning you know from our
course. (Of course, we have slightly deviated from the truth, Q-Learning was invented by Chris
Watkins in 1989). Part of Q’s design thinking for q-learning is right, but he forgot something else…
Give the reasons why the new James Bot AI equipped with Q-learning will still fail miserably in a
real situation when trying to conquer a new laser chamber.
... and so James Bond will remain indispensable for another sequel.
4 Assignment 2.nb