2. Monte Carlo policy gradient (REINFORCE) method
The simplest policy gradient method is REINFORCE [4], which is a Monte Carlo policy gradient method:
data:image/s3,"s3://crabby-images/1db8c/1db8c8a750dd7aabcffd0dabd66665d1d0998011" alt=""
where Rt is the return as defined in Equation 9.1.2. Rt is an unbiased sample of in the policy gradient theorem.
Algorithm 10.2.1 summarizes the REINFORCE algorithm [2]. REINFORCE is a Monte Carlo algorithm. It does not require knowledge of the dynamics of the environment (in other words, model-free). Only experience samples, ,are needed to optimally tune the parameters of the policy network,
. The discount factor,
, takes into consideration the fact that rewards decrease in value as the number of steps increases. The gradient is discounted by
. Gradients taken at later steps have smaller contributions. The learning rate,
, is a scaling factor of the gradient update.
The parameters are updated by performing gradient ascent using the discounted gradient and learning rate. As a Monte Carlo algorithm...