Summary
This chapter dealt with temporal difference learning. We started by studying one-step methods in both their on-policy and off-policy implementations, leading to us learning about the SARSA and Q-learning algorithms, respectively. We tested these algorithms on the FrozenLake-v0 problem and covered both deterministic and stochastic transition dynamics. Then, we moved on to the N-step temporal difference methods, the first step toward the unification of TD and MC methods. We saw how on-policy and off-policy methods are extended to this case. Finally, we studied TD methods with eligibility traces, which constitute the most relevant step toward the formalization of a unique theory describing both TD and MC algorithms. We extended SARSA to eligibility tracing, too, and learned about this through implementing two exercises where it has been implemented and applied to the FrozenLake-v0 environment under both deterministic and stochastic transition dynamics. With this, we have been...