Formulation of the MAB Problem
In its most simple form, the MAB problem consists of multiple slot machines (casino gambling machines), each of which can return a stochastic reward to the player each time it is played (specifically, when its arm is pulled). The player, who would like to maximize their total reward at the end of a fixed number of rounds, does not know the probability distribution or the average reward that they will obtain from each slot machine. The problem, therefore, boils down to the design of a learning strategy where the player needs to explore what possible reward values each slot machine can return and from there, quickly identify the one that is most likely to return the greatest expected reward.
In this section, we will briefly explore the background of the problem and establish the notation and terminology that we will be using throughout this chapter.
Applications of the MAB Problem
The slot machines we mentioned earlier are just a simplification...