The TD(0) algorithm
One of the problems with dynamic programming algorithms is the need for full knowledge of the environment in terms of states and transition probabilities. Unfortunately, there are many cases where these pieces of information are unknown before the direct experience. In particular, states can be discovered by letting the agent explore the environment, but transition probabilities require us to count the number of transitions to a certain state and this is often impossible. Moreover, an environment with absorbing states can prevent visiting many states if the agent has learned a good initial policy. For example, in a game, which can be described as an episodic MDP, the agent discovers the environment while learning how to move forward without ending in a negative absorbing state.
A general solution to these problems is provided by a different evaluation strategy, called Temporal Difference (TD) RL. In this case, we start with an empty value matrix and we let the...