Temporal difference (TD) learning is also a model-free learning algorithm, just like MC learning. You will recall that Q-function is updated at the end of the entire episode in MC learning (either in first - visit or every - visit mode). The main advantage of TD learning is that it updates the Q-function for every step in an episode.
In this recipe, we will look into a popular TD method called Q-learning. Q-learning is an off-policy learning algorithm. It updates the Q-function based on the following equation:
Here, s' is the resulting state after taking action, a, in state s; r is the associated reward; α is the learning rate; and γ is the discount factor. Also, means that the behavior policy is greedy, where the highest Q-value among those in state s' is selected to generate learning data. In Q-learning, actions are...