Soft Actor-Critic
The algorithm for Soft Actor-Critic (SAC) is given as follows:
- Initialize the main value network parameter , the Q network parameters and , and the actor network parameter
- Initialize the target value network by just copying the main value network parameter
- Initialize the replay buffer
- For N number of episodes, repeat step 5
- For each step in the episode, that is, for t = 0, . . ., T – 1
- Select action a based on the policy , that is,
- Perform the selected action a, move to the next state , get the reward r, and store the transition information in the replay buffer
- Randomly sample a minibatch of K transitions from the replay buffer
- Compute target state value
- Compute the loss of value network and update the parameter using gradient descent,
- Compute the target Q value
- Compute the loss of the Q networks and update the parameter using gradient descent...