On-Policy TD Control – SARSA
The algorithm for on-policy TD control – SARSA is given as follows:
- Initialize the Q function Q(s, a) with random values
- For each episode:
- Initialize the state s
- Extract a policy from Q(s, a) and select an action a to perform in the state s
- For each step in the episode:
- Perform the action a, move to the new state , and observe the reward r
- In the state , select the action using the epsilon-greedy policy
- Update the Q value to
- Update and (update the next state -action pair to the current state s-action a pair)
- If s is not the terminal state, repeat steps 1 to 5