The Greedy Algorithm
Recall the brief interaction we had with the Bandit
instance in the previous section, in which we pulled the first arm 10 times and the second 10 times. This might not be the best strategy to maximize our cumulative reward as we are spending 10 rounds pulling a sub-optimal arm, whichever it is among the two. The naïve approach is, therefore, to simply pull both (or all) of the arms once and greedily commit to the one that returns a positive reward.
A generalization of this strategy is the Greedy algorithm, in which we maintain the list of reward averages across all available arms and at each step, we choose to pull the arm with the highest average. While the intuition is simple, it follows the probabilistic rationale that after a large number of samples, the empirical mean (the average of the samples) is a good approximation of the actual expectation of the distribution. If the reward average of an arm is larger than that of any other arm, the probability...