Many variants exist of the vanilla model-based and model-free algorithms introduced in the pseudocode in the A useful combination section. Pretty much all of them propose different ways to deal with the imperfections of the model of the environment.
This is a key problem to address in order to reach the same performance as model-free methods. Models learned from complex environments will always have some inaccuracies. So, the main challenge is to estimate or control the uncertainty of the model to stabilize and accelerate the learning process.
ME-TRPO proposes the use of an ensemble of models to maintain the model uncertainty and regularize the learning process. The models are deep neural networks with different weight initialization and training data. Together, they provide a more robust general model of the environment that is less prone...