Comparing the DP, MC, and TD methods
So far, we have learned several interesting and important reinforcement learning algorithms, such as DP (value iteration and policy iteration), MC methods, and TD learning methods, to find the optimal policy. These are called the key algorithms in classic reinforcement learning, and understanding the differences between these three algorithms is very important. So, in this section, we will recap the differences between the DP, MC, and TD learning methods.
Dynamic programming (DP), that is, the value and policy iteration methods, is a model-based method, meaning that we compute the optimal policy using the model dynamics of the environment. We cannot apply the DP method when we don't have the model dynamics of the environment.
We also learned about the Monte Carlo (MC) method. MC is a model-free method, meaning that we compute the optimal policy without using the model dynamics of the environment. But one problem we face with the MC...