Temporal-difference learning
Q-Learning is a special case of a more generalized Temporal-Difference Learning or TD-Learning . More specifically, it's a special case of one-step TD-Learning TD(0):
(Equation 9.5.1)
In the equation is the learning rate. We should note that when , Equation 9.5.1 is similar to the Bellman equation. For simplicity, we'll refer to Equation 9.5.1 as Q-Learning or generalized Q-Learning.
Previously, we referred to Q-Learning as an off-policy RL algorithm since it learns the Q value function without directly using the policy that it is trying to optimize. An example of an on-policy one-step TD-learning algorithm is SARSA which similar to Equation 9.5.1:
(Equation 9.5.2)
The main difference is the use of the policy that is being optimized to determine a'. The terms s, a, r, s' and a' (thus the name SARSA) must be known to update the Q value function at every iteration. Both Q-Learning and SARSA use existing estimates...