Up until now, we have considered most of the training/challenge environments we've looked at as being episodic; that is, the game or environment has a beginning and an end. This is good since most games have a beginning and an end – it is, after all, a game. However, in the real world, or for some games, an episode could last days, weeks, months, or even years. For these types of environment, we no longer think of an episode; rather we work with the concept of an environment that requires continuous control. So far, we have looked at a subset of algorithms that can solve this type of problem but they don't do so very well. So, like most things in RL, we have a special class of algorithms devoted to those types of environment, and we'll explore them in this chapter.
In this chapter, we'll look at improving the policy...