We've just solved the Mountain Car problem using the off-policy Q-learning algorithm in the previous recipe. Now, we will do so with the on-policy State-Action-Reward-State-Action (SARSA) algorithm (the FA version of course).
In general, the SARSA algorithm updates the Q-function based on the following equation:
Here, s' is the resulting state after taking action, a, in state s; r is the associated reward; α is the learning rate; and γ is the discount factor. We simply pick up the next action, a', by also following an epsilon-greedy policy to update the Q value. And the action, a', is taken in the next step. Accordingly, SARSA with FA has the following error term:
Our learning goal is to minimize the error term to zero, which means that the estimated V(st) should satisfy the following equation...