Temporal-difference learning
The first class of methods to solve MDP we covered in this chapter was DP, which
- Requires to completely know the environment dynamics to be able find the optimal solution.
- Allow us to progress toward the solution with one-step updates of the value functions.
We then covered the MC methods, which
- Only require the ability to sample from the environment, therefore learn from experience, as opposed to knowing the environment dynamics - a huge advantage over DP,
- But need to wait for a complete episode trajectory to update a policy.
Temporal-difference (TD) methods are, in some sense, the best of both worlds: They learn from experience, and they can update the policy after each step by bootstrapping. This comparison of TD to DP and MC is illustrated in Table 5.2.
As a result, TD methods are central in RL, and you will encounter them...