Simple REINFORCE has the notable property of being unbiased, but it exhibits high variance. Adding a baseline reduces the variance, while keeping it unbiased (asymptotically, the algorithm will converge to a local minimum). A major drawback of REINFORCE with baseline is that it'll converge very slowly, requiring a consistent number of interactions with the environment.
An approach to speed up training is called bootstrapping. This is a technique that we've already seen many times throughout the book. It allows the estimation of the return values from the subsequent state values. The policy gradient algorithms that use this techniques is called actor-critic (AC). In the AC algorithm, the actor is the policy, and the critic is the value function (typically, a state-value function) that "critiques" the behavior of the actor, to help...