The Relationship between DP, Monte-Carlo, and TD Learning
From what we've learned in this chapter, and as we've stated multiple times, it is clear how temporal difference learning has characteristics in common with both Monte Carlo methods and dynamic programming ones. Like the former, it learns directly from experience, without leveraging a model of the environment representing transition dynamics or knowledge of the reward function involved in the task. Like the latter, it bootstraps, meaning that it updates the value function estimate partially based on other estimates, thereby circumventing the need to wait until the end of the episode. This point is particularly important since, in practice, very long episodes (or even infinite ones) can be encountered, making MC methods impractical and too slow. This strict relation plays a central role in reinforcement learning theory.
We have also learned about N-step methods and eligibility traces, two different but related topics...