Asynchronous one-step SARSA
The architecture of asynchronous one-step SARSA is almost similar to the architecture of asynchronous one-step Q-learning, except the way target state-action value of the current state is calculated by the target network. Instead of using the maximum Q-value of the next state s' by the target network, SARSA usesÂ
![](https://static.packt-cdn.com/products/9781788835725/graphics/2324abde-33b8-48f6-a066-a0620beefd09.png)
-greedy to choose the action a' for the next state s' and the Q-value of the next state action pair, that is, Q(s',a';
![](https://static.packt-cdn.com/products/9781788835725/graphics/f0a37360-c9e0-4105-8798-8c59ef3512d7.png)
) is used to calculate the target state-action value of the current state.Â
The pseudo-code for asynchronous one-step SARSA is shown below. Here, the following are the global parameters:
- Â :Â the parameters (weights and biases) of the policy network
-  : parameters (weights and biases) of the target network Â
- T : overall time step counterÂ
// Globally shared parameters,and T //is initialized arbitrarily // T is initialized 0 pseudo-code for each learner running parallel in each of the threads: Initialize thread level time step counter t=0...