Cliff walking example of on-policy and off-policy of TD control
A cliff walking grid-world example is used to compare SARSA and Q-learning, to highlight the differences between on-policy (SARSA) and off-policy (Q-learning) methods. This is a standard undiscounted, episodic task with start and end goal states, and with permitted movements in four directions (north, west, east and south). The reward of -1 is used for all transitions except the regions marked The Cliff, stepping on this region will penalize the agent with reward of -100 and sends the agent instantly back to the start position.
The following snippets of code have taken inspiration from Shangtong Zhang's Python codes for RL and are published in this book with permission from the student of Richard S. Sutton, the famous author of Reinforcement Learning: An Introduction (details provided in the Further reading section):
# Cliff-Walking - TD learning - SARSA & Q-learning
>>> from __future__ import print_function
>...