You will recall that Q-learning is an off-policy TD learning algorithm. In this recipe, we will solve an MDP with an on-policy TD learning algorithm, called State-Action-Reward-State-Action (SARSA).
Similar to Q-learning, SARSA focuses on state-action values. It updates the Q-function based on the following equation:
Here, s' is the resulting state after taking the action, a, in state s; r is the associated reward; α is the learning rate; and γ is the discount factor. You will recall that in Q-learning, a behavior-greedy policy, , is used to update the Q value. In SARSA, we simply pick up the next action, a', by also following an epsilon-greedy policy to update the Q value. And the action a' is taken in the next step. Hence, SARSA is an on-policy algorithm.