3. Q-learning example
To illustrate the Q-learning algorithm, we need to consider a simple deterministic environment, as shown in Figure 9.3.1. The environment has six states.
The rewards for allowed transitions are shown. The reward is non-zero in two cases. Transition to the Goal (G) state has a +100 reward, while moving into the Hole (H) state has a -100 reward. These two states are terminal states and constitute the end of one episode from the Start state:
Figure 9.3.1: Rewards in a simple deterministic world
To formalize the identity of each state, we use a (row, column) identifier as shown in Figure 9.3.2. Since the agent has not learned anything yet about its environment, the Q-table also shown in Figure 9.3.2 has zero initial values. In this example, the discount factor . Recall that in the estimate of the current Q value, the discount factor determines the weight of future Q values as a function of the number of steps, . In Equation 9.2.3, we only consider...