RL methods such as Monte Carlo, SARSA, Q-learning, or Actor-Critic are model-free. The main goal of the agent is to learn an (imperfect) estimation of either the true value function (MC, SARSA, Q-learning) or the optimal policy (AC). As the learning goes on, the agent needs to have a way to explore the environment in order to collect experiences for its training. Usually, this happens with trial and error. For example, an ε-greedy policy will take random actions at certain times, just for the sake of environment exploration.
In this section, we'll introduce model-based RL methods, where the agent won't follow the trial-and-error approach when it takes new actions. Instead, it will plan the new action with the help of a model of the environment. The model will try to simulate how the environment will react to a given action. Then, the agent...