Chapter 5: Solving the Reinforcement Learning Problem
In the previous chapter we provided the mathematical foundations for modeling a reinforcement learning problem. In this chapter, we lay the foundation for solving it. Many of the following chapters will focus on some specific solution approaches that will rise on this foundation. To this end, we first cover the dynamic programming (DP) approach, with which we introduce some key ideas and concepts. DP methods provide optimal solutions to Markov decision processes (MDPs), yet they require the complete knowledge and a compact representation of the state transition and reward dynamics of the environment. This could be severely limiting and impractical in a realistic scenario, where the agent is either directly trained in the environment itself or in a simulation of it. Monte Carlo and temporal difference (TD) approaches that we cover later, unlike DP, use sampled transitions from the environment and relax the aforementioned limitations...