Bandit algorithms
A Multi-Armed Bandit (MAB) is a classic reinforcement learning problem, in which a player is faced with a slot machine (bandit) that has k levers (arms), each with a different reward distribution. The agent's goal is to maximize its cumulative reward on a trial-by-trial basis. Since MABs are a simple but powerful framework for algorithms that make decisions over time under uncertainty, a large number of research articles have been dedicated to them.
Bandit learning refers to algorithms that aim to optimize a single unknown stationary objective function. An agent chooses an action from a set of actions . The environment reveals reward of the chosen action at time t. As information is accumulated over multiple rounds, the agent can build a good representation of the value (or reward) distribution for each arm, .
Therefore, a good policy might converge so that the choice of arm becomes optimal. According to one policy, UCB1 (published by Peter Auer, Nicol...