The MAB problem is one of the classical problems in RL. An MAB is actually a slot machine, a gambling game played in a casino where you pull the arm (lever) and get a payout (reward) based on a randomly generated probability distribution. A single slot machine is called a one-armed bandit and, when there are multiple slot machines it is called multi-armed bandits or k-armed bandits.
MABs are shown as follows:
As each slot machine gives us the reward from its own probability distribution, our goal is to find out which slot machine will give us the maximum cumulative reward over a sequence of time. So, at each time step t, the agent performs an action at, that is, pulls an arm from the slot machine and receives a reward rt, and the goal of our agent is to maximize the cumulative reward.
We define the value of an arm Q(a) as average rewards received by pulling the...