Multi-armed bandit algorithms are probably among the most popular algorithms in reinforcement learning. This chapter will start by creating a multi-armed bandit and experimenting with random policies. We will focus on how to solve the multi-armed bandit problem using four strategies, including epsilon-greedy, softmax exploration, upper confidence bound, and Thompson sampling. We will see how they deal with the exploration-exploitation dilemma in their own unique ways. We will also work on a billion-dollar problem, online advertising, and demonstrate how to solve it using a multi-armed bandit algorithm. Finally, we will solve the contextual advertising problem using contextual bandits to make more informed decisions in ad optimization.
The following recipes will be covered in this chapter:
- Creating a multi-armed bandit environment
- Solving multi...