Policy gradient methods on CartPole
Nowadays, almost nobody uses the vanilla policy gradient method, as the much more stable actor-critic method exists. However, I still want to show the policy gradient implementation, as it establishes very important concepts and metrics to check the policy gradient method's performance.
Implementation
So, we will start with a much simpler environment of CartPole, and in the next section, we will check its performance on our favorite Pong environment.
The complete code for the following example is available in Chapter11/04_cartpole_pg.py
.
GAMMA = 0.99
LEARNING_RATE = 0.001
ENTROPY_BETA = 0.01
BATCH_SIZE = 8
REWARD_STEPS = 10
Besides the already familiar hyperparameters, we have two new ones: the ENTROPY_BETA
value is the scale of the entropy bonus and the REWARD_STEPS
value specifies how many steps ahead the Bellman equation is unrolled to estimate the discounted total reward of every transition.
class PGN...