Q-Learning example
To illustrate the Q-Learning algorithm, we need to consider a simple deterministic environment, as shown in the following figure. The environment has six states. The rewards for allowed transitions are shown. The reward is non-zero in two cases. Transition to the Goal (G) state has +100 reward while moving into Hole (H) state has -100 reward. These two states are terminal states and constitute the end of one episode from the Start state:
To formalize the identity of each state, we need to use a (row, column) identifier as shown in the following figure. Since the agent has not learned anything yet about its environment, the Q-Table also shown in the following figure has zero initial values. In this example, the discount factor, . Recall that in the estimate of current Q value, the discount factor determines the weight of future Q values as a function of the number of steps, . In Equation 9.2.3, we only consider the...