Whatβs wrong with π-greedy?
Throughout the book, we have used the π-greedy exploration strategy as a simple, but still acceptable, approach to exploring the environment. The underlying idea behind π-greedy is to take a random action with the probability of π; otherwise, (with 1 βπ probability) we act according to the policy (greedily). By varying the hyperparameter 0 β€π β€ 1, we can change the exploration ratio. This approach was used in most of the value-based methods described in the book. Quite a similar idea was used in policy-based methods, when our network returns the probability distribution over actions to take. To prevent the network from becoming too certain about actions (by returning a probability of 1 for a specific action and 0 for others), we added the entropy loss, which is just the entropy of the probability distribution multiplied by some hyperparameter. In the early stages of the training...