Introduction
In the previous chapter, we discussed the technique of temporal difference learning, a popular model-free reinforcement learning algorithm that predicts a quantity via the future values of a signal. In this chapter, we will focus on another common topic, not only in reinforcement learning but also in artificial intelligence and probability theory – the Multi-Armed Bandit (MAB) problem.
Framed as a sequential decision-making problem to maximize the reward while playing at the slot machines in a casino, the MAB problem is highly applicable for any situation where sequential learning under uncertainty is needed, such as A/B testing or designing recommender systems. In this chapter, we will be introduced to the formalization of the problem, learn about the different common algorithms as solutions to the problem (namely Epsilon Greedy, Upper Confidence Bound, and Thompson Sampling), and finally implement them in Python.
Overall, this chapter will offer you a deep...