In Chapter 2, Markov Decision Process and Dynamic Programming, we applied DP to perform policy evaluation, which is the value (or state-value) function of a policy. It works really well, but has some limitations. Fundamentally, it requires a fully known environment, including the transition matrix and reward matrix. However, the transition matrix in most real-life situations is not known beforehand. A reinforcement learning algorithm that needs a known MDP is categorized as a model-based algorithm. On the other hand, one with no requirement of prior knowledge of transitions and rewards is called a model-free algorithm. Monte Carlo-based reinforcement learning is a model-free approach.
In this recipe, we will evaluate the value function using the Monte Carlo method. We will use the FrozenLake environment again as an example, assuming we...