In Chapter 3, Markov Decision Process, we used states, actions, rewards, transition models, and discount factors to solve our Markov decision process, that is, the MDP problem. Thus, if all these elements of an MDP problem are available, we can easily use a planning algorithm to come up with a solution to the objective. This type of learning is called model based learning, where an AI agent will interact with the environment and based on its interactions, will try to approximate the environment's model, that is, the state transition model. Given the model, now the agent can try to find the optimum policy through value iteration or policy iteration.
But its not necessary for our AI agent to learn an explicit model of the environment. It can derive optimal policy directly from its interactions with the environment without building...