Summary
We started off the chapter by understanding what TRPO is and how it acts as an improvement to the policy gradient algorithm. We learned that when the new policy and old policy vary greatly then it causes model collapse.
So in TRPO, we make a policy update while imposing the constraint that the parameters of the old and new policies should stay within the trust region. We also learned that TRPO guarantees monotonic policy improvement; that is, it guarantees that there will always be a policy improvement on every iteration.
Later, we learned about the PPO algorithm, which acts as an improvement to the TRPO algorithm. We learned about two types of PPO algorithm: PPO-clipped and PPO-penalty. In the PPO-clipped method, in order to ensure that the policy updates are in the trust region, PPO adds a new function called the clipping function that ensures the new and old policies are not far away from each other. In the PPO-penalty method, we modify our objective function...