We just saw how planning can be computationally expensive both during training and runtime, and how, in more complex environments, planning algorithms aren't able to achieve good performances. The other strategy that we briefly hinted at is to learn a policy. A policy is certainly much faster in inference as it doesn't have to plan at each step.
A simple, yet effective, way to learn a policy is to combine model-based with model-free learning. With the latest innovations in model-free algorithms, this combination has gained in popularity and is the most common approach to date. The algorithm we'll develop in the next section, ME-TRPO, is one such method. Let's dive further into these algorithms.