In this last recipe, by way of a bonus (and fun) section, we will develop a simple, yet powerful, algorithm to solve CartPole. It is based on cross-entropy, and directly maps input states to an output action. In fact, it is more straightforward than all the other policy gradient algorithms in this chapter.
We have applied several policy gradient algorithms to solve the CartPole environment. They use complicated neural network architectures and a loss function, which may be overkill for simple environments such as CartPole. Why don't we directly predict the actions for given states? The idea behind this is straightforward: we model the mapping from state to action, and train it ONLY with the most successful experiences from the past. We are only interested in what the correct actions should be. The objective function, in this...