Summary
We started the chapter by understanding what the actor-critic method is. We learned that in the actor-critic method, the actor computes the optimal policy, and the critic evaluates the policy computed by the actor network by estimating the value function. Next, we learned how the actor-critic method differs from the policy gradient method with the baseline.
We learned that in the policy gradient method with the baseline, first, we generate complete episodes (trajectories), and then we update the parameter of the network. Whereas, in the actor-critic method, we update the parameter of the network at every step of the episode. Moving forward, we learned what the advantage actor-critic algorithm is and how it uses the advantage function in the gradient update.
At the end of the chapter, we learned about another interesting actor-critic algorithm, called asynchronous advantage actor-critic method. We learned that A3C consists of several worker agents and...