In this chapter, we presented the natural evolution of TD(0) based on an average of backups with different lengths. The algorithm, called TD(), is extremely powerful, and it ensures faster convergence than TD(0), with only a few (non-restrictive) conditions. We also showed how to implement the Actor-Critic method with TD(0) in order to learn about both a stochastic policy and a value function.
In later sections, we discussed two methods based on the estimation of the Q function: SARSA and Q-learning. They are very similar, but the latter has a greedy approach, and its performance (in particular the training speed) results in it being superior to SARSA. The Q-learning algorithm is one of the most important models for the latest developments. In fact, it was the first RL approach employed with a deep convolutional network to solve complex environments (like Atari games). For this reason, we also presented a simple example based on an MLP that processes visual input and outputs...