What’s wrong with 𝜖-greedy?
Throughout the book, we have used the 𝜖-greedy exploration strategy as a simple, but still acceptable, approach to exploring the environment. The underlying idea behind 𝜖-greedy is to take a random action with the probability of 𝜖; otherwise, (with 1 −𝜖 probability) we act according to the policy (greedily). By varying the hyperparameter 0 ≤𝜖 ≤ 1, we can change the exploration ratio. This approach was used in most of the value-based methods described in the book. Quite a similar idea was used in policy-based methods, when our network returns the probability distribution over actions to take. To prevent the network from becoming too certain about actions (by returning a probability of 1 for a specific action and 0 for others), we added the entropy loss, which is just the entropy of the probability distribution multiplied by some hyperparameter. In the early stages of the training...