Monte Carlo policy gradient (REINFORCE) method
The simplest policy gradient method is called REINFORCE [5], this is a Monte Carlo policy gradient method:
(Equation 10.2.1)
where Rt is the return as defined in Equation 9.1.2. Rt is an unbiased sample of in the policy gradient theorem.
Algorithm 10.2.1 summarizes the REINFORCE algorithm [2]. REINFORCE is a Monte Carlo algorithm. It does not require knowledge of the dynamics of the environment (that is, model-free). Only experience samples, , are needed to optimally tune the parameters of the policy network, . The discount factor, , takes into consideration that rewards decrease in value as the number of steps increases. The gradient is discounted by . Gradients taken at later steps have smaller contributions. The learning rate, , is a scaling factor of the gradient update.
The parameters are updated by performing gradient ascent using the discounted gradient and learning rate. As a Monte Carlo algorithm, REINFORCE requires...