Multi-armed bandits, like other reinforcement learning agents, benefit from being able to learn from long-term feedback. One way to implement this is to introduce states into the multi-armed bandit model:
What advantage does being able to differentiate states from each other give to the bandit?
If we're deciding which advertisements to show to an app user, for example, it might be helpful to have some consistent outside knowledge about that user and to take that knowledge into account. With no state information, all we would know is how successful that advertisement was for overall users. With state information, the model would have access to factors such as the user's age, gender, or income, and be able to make more targeted predictions based on that knowledge.
We also wouldn't have to limit our model to just demographic...