The UCB algorithm
The term upper confidence bound denotes the fact that instead of considering the average of past rewards returned from each arm like Greedy, the algorithm computes an upper bound for its estimates of the expected reward for each arm.
This concept of a confidence bound is quite common in probability and statistics, where the distribution of a quantity that we care about (in this case, the reward from each arm) cannot be represented well using simply the average of past observations. Instead, a confidence bound is a numerical range that aims to estimate and narrow down where most of the values in the distribution in question will lie. For example, this idea is widely used in Bayesian analyses and Bayesian optimization.
In the following section, we will discuss how UCB establishes its use of a confidence bound.
Optimism in the Face of Uncertainty
Consider the middle of the process of a bandit with only two arms. We have already pulled the first arm 100 times...