In the previous chapter, we introduced the temporal difference strategy, and we discussed a simple example called TD(0). In the case of TD(0), the discounted reward is approximated by using a one-step backup. Hence, if the agent performs an action at in the state st, and the transition to the state st+1 is observed, the approximation becomes the following:

If the task is episodic (as in many real-life scenarios) and has T(ei) steps, the complete backup for the episode ei is as follows:

The previous expression ends when the MDP process reaches an absorbing state; therefore, Rt is the actual value of the discounted reward. The difference between TD(0) and this choice is clear: in the first case, we can update the value function after each transition, whereas with a complete backup, we need to wait for the end of the episode. We can say that this method...