The architecture of asynchronous one-step SARSA is almost similar to the architecture of asynchronous one-step Q-learning, except the way target state-action value of the current state is calculated by the target network. Instead of using the maximum Q-value of the next state s' by the target network, SARSA uses -greedy to choose the action a' for the next state s' and the Q-value of the next state action pair, that is, Q(s',a';) is used to calculate the target state-action value of the current state.
The pseudo-code for asynchronous one-step SARSA is shown below. Here, the following are the global parameters:
- : the parameters (weights and biases) of the policy network
- : parameters (weights and biases) of the target network
- T : overall time step counter
// Globally...