Instead of exploring solely with random policy, we can do better with a combination of exploration and exploitation. Here comes the well-known epsilon-greedy policy.
Epsilon-greedy for multi-armed bandits exploits the best action the majority of the time and also keeps exploring different actions from time to time. Given a parameter, ε, with a value from 0 to 1, the probabilities of performing exploration and exploitation are ε and 1 - ε, respectively:
- Epsilon: Each action is taken with a probability calculated as follows:
Here, |A| is the number of possible actions.
- Greedy: The action with the highest state-action value is favored, and its probability of being chosen is increased by 1 - ε: