TD(0) – SARSA and Q-Learning
TD methods are model-free, meaning they do not need a model of the environment to learn a state value representation. For a given policy, , they accumulate experience associated with it and update their estimate of the value function for every state encountered during the corresponding experience. In doing so, TD methods update a given state value, visited at time t
, using the value of state (or states) encountered at the next few time steps, so for time t+1
, t+2
, ..., t+n
. An abstract example is as follows: an agent is initialized in the environment and starts interacting with it by following a given policy, without any knowledge of what results are generated by which action. Following a certain number of steps, the agent will eventually reach a state associated with a reward. This reward signal is used to increment the values of previously visited states (or action-state pairs) with the TD learning rule. In fact, those states have allowed the agent...