In this recipe, we will solve the multi-armed bandit problem using the softmax exploration, algorithm. We will see how it differs from the epsilon-greedy policy.
As we've seen with epsilon-greedy, when performing exploration we randomly select one of the non-best arms with a probability of ε/|A|. Each non-best arm is treated equivalently regardless of its value in the Q function. Also, the best arm is chosen with a fixed probability regardless of its value. In softmax exploration, an arm is chosen based on a probability from the softmax distribution of the Q function values. The probability is calculated as follows:
Here, the τ parameter is the temperature factor, which specifies the randomness of the exploration. The higher the value of τ, the closer to equal exploration it becomes; the lower...