In this chapter, we presented the natural evolution of TD(0), based on an average of backups with different lengths. The algorithm, called TD(λ), is extremely powerful, and it assures a faster convergence than TD(0), with only a few (non-restrictive) conditions. We also showed how to implement the Actor-Critic method with TD(0), in order to learn about both a stochastic policy and a value function.
In further sections, we discussed two methods based on the estimation of the Q function: SARSA and Q-learning. They are very similar, but the latter has a greedy approach, and its performance (in particular, the training speed) results in it being superior to SARSA. The Q-learning algorithm is one of the most important models for the latest developments. In fact, it was the first RL approach employed with a Deep Convolutional Network to solve complex environments (like...