Comparison of the policy-based methods in Lunar Lander
Below is a comparison of evaluation reward performance progress for different policy-based algorithms over a single training session in the Lunar Lander environment:
To also give a sense of how long each training session took and what was the performance at the end of the training, below is TensorBoard tooltip for the plot above:
Before going into further discussions, let's make the following disclaimer: The comparisons here should not be taken as a benchmark of different algorithms for multiple reasons:
- We did not perform any hyper-parameter tuning,
- The plots come from a single training trial for each algorithm. Training an RL agent is a highly stochastic process and a fair comparison should include...