Chapter 6: Deep Q-Learning at Scale
In the previous chapter, we covered dynamic programming (DP) methods to solve Markov decision processes, and then mentioned that they suffer two important limitations: DP i) assumes complete knowledge of the environment's reward and transition dynamics; ii) uses tabular representations of state and actions, which is not scalable as the number of possible state-action combinations is too big in many realistic applications. We have addressed the former by introducing the Monte Carlo (MC) and temporal-difference (TD) methods, which learn from their interactions with the environment (often in simulation) without needing to know the environment dynamics. On the other hand, the latter is yet to be addressed, and this is where deep learning comes in. Deep reinforcement learning (deep RL or DRL) is about utilizing neural networks' representational power to learn policies for a wide variety of situations.
As great as it sounds, though, it is...