SAC
In the final section, we will check our environments on a relatively new method called SAC, which was proposed by a group of Berkeley researchers and introduced in the paper Soft actor-critic: Off-policy maximum entropy deep reinforcement learning, by Haarnoja et al., published in 2018 [Haa+18].
At the moment, it’s considered to be one of the best methods for continuous control problems and is very widely used. The core idea of the method is closer to the DDPG method than to A2C policy gradients. We will compare it directly with PPO’s performance, which has been considered to be the standard in continuous control problems for a long time.
The central idea of the SAC method is entropy regularization, which adds a bonus reward at each timestamp that is proportional to the entropy of the policy at this timestamp. In mathematical notation, the policy we’re looking for is the following:
Here, H(P) = 𝔼 x∼...