5. Temporal-difference learning
Q-learning is a special case of a more generalized TD learning, . More specifically, it is a special case of one-step TD learning, TD(0):
![](https://static.packt-cdn.com/products/9781838821654/graphics/Images/14853_09_045.png)
Where is the learning rate. Note that when
, Equation 9.5.1 is similar to the Bellman equation. For simplicity, we also refer to Equation 9.5.1 as Q-learning, or generalized Q-learning.
Previously, we referred to Q-learning as an off-policy RL algorithm since it learns the Q value function without directly using the policy that it is trying to optimize. An example of an on-policy one-step TD-learning algorithm is SARSA, which is similar to Equation 9.5.1:
![](https://static.packt-cdn.com/products/9781838821654/graphics/Images/14853_09_048.png)
The main difference is the use of the policy that is being optimized to determine . The terms
,
,
,
, and
(thus the name SARSA) must be known to update the Q value function every iteration. Both Q-learning and SARSA use existing estimates in the Q value iteration, a process known as bootstrapping...