K-armed bandit
The K-armed bandit is a metaphor representing a casino slot machine with k pull levers (or arms). The user or customer pulls any one of the levers to win a predefined reward. The objective is obviously to select the lever that will provide the user with the highest reward:
data:image/s3,"s3://crabby-images/514c3/514c38c66d3df434290678cb59280fb78fc627df" alt="K-armed bandit"
2-Arm bandit
Although the challenge could be defined as an optimization problem, it is a classification problem. There is no ability to assign any of the K levers a specific reward; therefore, the model is generated through reinforcement learning [14:1].
The basic concept of reinforcement learning is illustrated in the following diagram:
data:image/s3,"s3://crabby-images/8c6f2/8c6f21110dcc86f9bf34121c41df5040ddd7fa16" alt="K-armed bandit"
Illustration of action and reward for a multiarmed bandit
The actor selects and plays the arm with the highest estimate reward, collects the reward, and re-computes the statistics or performance for the selected arm.
Note
Markov decision process
The K-armed bandit problem can be defined as the one state Markov decision process (MDP) (see the Markov decision process section in...