In the previous chapter, we solved MDPs by means of the Monte Carlo method, which is a model-free approach that requires no prior knowledge of the environment. However, in MC learning, the value function and Q function are usually updated until the end of an episode. This could be problematic, as some processes are very long or even fail to terminate. We will employ the temporal difference (TD) method in this chapter to solve this issue. In the TD method, we update the action values in every time step in an episode, which increases learning efficiency significantly.
The chapter will start with setting up the Cliff Walking and Windy Gridworld environment playgrounds, which will be used in TD control methods as the main talking point in this chapter. Through our step-by-step guides, readers will gain practical experience of Q-learning for off...