K-armed bandit
The K-armed bandit is a metaphor representing a casino slot machine with k pull levers (or arms). The user or customer pulls any one of the levers to win a predefined reward. The objective is obviously to select the lever that will provide the user with the highest reward:
Although the challenge could be defined as an optimization problem, it is a classification problem. There is no ability to assign any of the K levers a specific reward; therefore, the model is generated through reinforcement learning [14:1].
The basic concept of reinforcement learning is illustrated in the following diagram:
The actor selects and plays the arm with the highest estimate reward, collects the reward, and re-computes the statistics or performance for the selected arm.
Note
Markov decision process
The K-armed bandit problem can be defined as the one state Markov decision process (MDP) (see the Markov decision process section in...