A work by Schulman and others shows that this is possible. Indeed, it uses a similar idea to TRPO while reducing the complexity of the method. This method is called Proximal Policy Optimization (PPO) and its strength is in the use of the first-order optimization only, without degrading the reliability compared to TRPO. PPO is also more general and sample-efficient than TRPO and enables multi updates with mini-batches.
Proximal Policy Optimization
A quick overview
The main idea behind PPO is to clip the surrogate objective function when it moves away, instead of constraining it as it does in TRPO. This prevents the policy from making updates that are too large. The main objective is as follows:
Here, is defined as...