Summary
In this chapter, we covered three important approaches to solving MDPs: Dynamic programming, Monte Carlo methods, and temporal-difference learning. We have seen that while DP provides exact solutions to MDPs, it requires knowing the precise dynamics of an environment. Monte Carlo and TD learning methods, on the other hand, explore in the environment and learn from experience. TD learning, in particular, can learn from even a single step transitions in the environment. Within the chapter, we also presented on-policy methods, which estimate the value functions for a behavior policy, while off-policy methods for a target policy. Finally, we discussed the importance of the simulator in RL experiments and what to pay attention to when working with one.
Next, we take our journey to a next level and dive into deep reinforcement learning, which will enable us to solve some complex real-world problems. Particularly, in the next chapter, we cover deep Q-learning in detail.
See...