In a reinforcement learning task, our goal is to take our increasing knowledge of a problem and use it to our advantage. We are not simply trying to gain the clearest possible picture of a problem; we are trying to benefit from the knowledge we currently have and not get distracted by potentially interesting alternative paths that might not be to our advantage to follow, and that in fact may harm us.
Let's briefly discuss what our ongoing investigation of a probability distribution looks like:
The preceding diagram shows what our current success rate might look like at any particular time in a bandit problem for all of the arms available to us. If we had conducted 1,000 trials already, for example, we might have discovered these success rates for each arm. On any particular trial, we would use that knowledge to decide which...