In this chapter, we learned how PG methods are not without their own faults and looked at ways to fix or correct them. This led us to explore more implementation methods that improved sampling efficiency and optimized the objective or clipped gradient function. We did this by looking at the PPO method, which uses clipped objective functions to optimize the region of trust we use to calculate the gradient. After that, we looked at adding a new network layer configuration to understand the context in state.
Then, we used the new layer type, an LSTM layer, on top of PPO to see the improvements it generated. Then, we looked at improving sampling using parallel environments and synchronous or asynchronous workers. We did this by implementing synchronous workers by building an A2C example, followed by looking at an example of using asynchronous workers on A3C. We finished this...