ES on HalfCheetah
In the next example, we'll go beyond the simplest ES implementation and look at how this method can be parallelized efficiently using the shared seed strategy proposed by the paper [1]. To show this approach, we'll use the environment from the roboschool library that we already experimented with in Chapter 15, Trust Regions – TRPO, PPO, and ACKTR, HalfCheetah, which is a continuous action problem where a weird two-legged creature gains reward by running forward without injuring itself.
First, let's discuss the idea of shared seeds. The performance of the ES algorithm is mostly determined by the speed that we can gather our training batch, which consists of sampling the noise and checking the total reward of the perturbed noise. As our training batch items are independent, we can easily parallelize this step to a large number of workers sitting on remote machines (that's a bit similar to the example from Chapter 11, Asynchronous Advantage...