Policy gradient methods on CartPole
Nowadays, almost nobody uses the vanilla policy gradient method, as the much more stable actor-critic method exists. However, I still want to show the policy gradient implementation, as it establishes very important concepts and metrics to check the policy gradient method’s performance.
Implementation
We will start with a much simpler environment of CartPole, and in the next section, we will check its performance in our favorite Pong environment. The complete code for the following example is available in Chapter11/04_cartpole_pg.py.
Besides the already familiar hyperparameters, we have two new ones:
GAMMA = 0.99
LEARNING_RATE = 0.001
ENTROPY_BETA = 0.01
BATCH_SIZE = 8
REWARD_STEPS = 10
The ENTROPY_BETA value is the scale of the entropy bonus and the REWARD_STEPS value specifies how many steps ahead the Bellman equation is unrolled to estimate...