At this point, we understand that it is very useful for an agent to learn the state value function , which informs the agent about the long-term value of being in state so that the agent can decide if it is a good state to be in or not. The Monte Carlo (MC) and Temporal Difference (TD) learning methods enable an agent to learn that!
The goal of MC and TD learning is to learn the value functions from the agent's experience as the agent follows its policy .
The following table summarizes the value estimate's update equation for the MC and TD learning methods:
Learning method | State-value function |
Monte Carlo | |
Temporal Difference |
MC learning updates the value towards the actual return , which is the total discounted reward from time step t. This means that until the end. It is important to note that we...