Monte Carlo policy gradient (REINFORCE) method
The simplest policy gradient method is called REINFORCE [5], this is a Monte Carlo policy gradient method:
(Equation 10.2.1)
where Rt is the return as defined in Equation 9.1.2. Rt is an unbiased sample of in the policy gradient theorem.
Algorithm 10.2.1 summarizes the REINFORCE algorithm [2]. REINFORCE is a Monte Carlo algorithm. It does not require knowledge of the dynamics of the environment (that is, model-free). Only experience samples, , are needed to optimally tune the parameters of the policy network,
. The discount factor,
, takes into consideration that rewards decrease in value as the number of steps increases. The gradient is discounted by
. Gradients taken at later steps have smaller contributions. The learning rate,
, is a scaling factor of the gradient update.
The parameters are updated by performing gradient ascent using the discounted gradient and learning rate. As a Monte Carlo algorithm, REINFORCE requires...