TD learning algorithms are based on reducing the differences between estimates made by the agent at different times. Q-learning, which we will discuss in the following section, is a TD algorithm, but it is based on the difference between states in immediately adjacent instants. TD is more generic and may consider moments and states further away.
TD is a combination of the ideas of the MC method and DP, both of which can be summarized as follows:
- MC methods allow the solving of reinforcement learning problems based on the average of the obtained results
- DP represents a set of algorithms that can be used to calculate an optimal policy given a perfect model of the environment in the form of a Markov Decision Process (MDP)
The following can be said of TD methods:
- They inherit from MC methods the idea of learning directly from experience accumulated...