TD() algorithm
In the previous chapter, we introduced the temporal difference strategy, and we discussed a simple example called TD(0). In the case of TD(0), the discounted reward is approximated by using a one-step backup.
Hence, if the agent performs an action at in the state st, and the transition to the state st+1 is observed, the approximation becomes the following:
If the task is episodic (as in many real-life scenarios) and has T(ei) steps, the complete backup for the episode ei is as follows:
The previous expression ends when the MDP process reaches an absorbing state; therefore, Rt is the actual value of the discounted reward. The difference between TD(0) and this choice is clear: in the first case, we can update the value function after each transition, whereas with a complete backup, we need to wait for the end of the episode. We can say that this method (which is called Monte Carlo, because it's based on the idea of averaging the overall...