2. Monte Carlo policy gradient (REINFORCE) method
The simplest policy gradient method is REINFORCE [4], which is a Monte Carlo policy gradient method:
where Rt is the return as defined in Equation 9.1.2. Rt is an unbiased sample of in the policy gradient theorem.
Algorithm 10.2.1 summarizes the REINFORCE algorithm [2]. REINFORCE is a Monte Carlo algorithm. It does not require knowledge of the dynamics of the environment (in other words, model-free). Only experience samples, ,are needed to optimally tune the parameters of the policy network, . The discount factor, , takes into consideration the fact that rewards decrease in value as the number of steps increases. The gradient is discounted by . Gradients taken at later steps have smaller contributions. The learning rate, , is a scaling factor of the gradient update.
The parameters are updated by performing gradient ascent using the discounted gradient and learning rate. As a Monte Carlo algorithm...