Temporal difference (TD) is a class of model-free RL methods. On the one hand, they can learn from the agent's experience, such as MC. On the other hand, they can estimate state values based on the values of other states, such as DP. As usual, we'll explore the policy evaluation and improvement tasks.
Temporal difference methods
Policy evaluation
TD methods rely on their experience for policy evaluation. But unlike MC, they don't have to wait until the end of an episode. Instead, they can update the action-value function after each step of the episode. In its most basic form, a TD algorithm uses the following formula to perform a state-value update:
Where α is called step size (learning rate) and it&apos...