Summary
This concludes the first of two chapters dedicated to reinforcement learning. In this chapter, we learned to balance exploration (learning) and exploitation (executing) by:
- Managing and reducing the confidence interval across the arms
- Applying the simple epsilon-greedy selection for exploring underplayed arms
- Leveraging the concept of probability matching through Thompson sampling for context-free bandits
- Using Upper Confidence Bounds to model the confidence interval as a function of the number of plays
The K-armed bandit problem is a viable solution for simple problems in which the interaction between the actor (player) and the environment (bandit) relies on a single state and immediate reward.
The next chapter introduces alternative approaches to multiarmed bandits for more complex, multi-state problems using value-actions and the Markovian decision process.