In the previous chapter, we evaluated and solved a Markov Decision Process (MDP) using dynamic programming (DP). Model-based methods such as DP have some drawbacks. They require the environment to be fully known, including the transition matrix and reward matrix. They also have limited scalability, especially for environments with plenty of states.
In this chapter, we will continue our learning journey with a model-free approach, the Monte Carlo (MC) methods, which have no requirement of prior knowledge of the environment and are much more scalable than DP. We will start by estimating the value of Pi with the Monte Carlo method. Moving on, we will talk about how to use the MC method to predict state values and state-action values in a first-visit and every-visit manner. We will demonstrate training an agent to play the Blackjack...