Going beyond bandits for personalization
When we covered multi-armed and contextual bandit problems in the early chapters of the book, we presented a case study that aimed maximizing the click-through rate (CTR) of online ads. This is just one example of how bandit models can be used to provide users with personalized content and experience, a common challenge of almost all online (and offline) content providers, from e-retailers to social media platforms. In this section, we go beyond the bandit models and describe a multi-step reinforcement learning approach to personalization. Let's first start with discussing where the bandit models fall short, and then how multi-step RL can address those issues.
Shortcomings of bandit models
The goal in bandit problems is to maximize the immediate (single step) return. In an online ad CTR maximization problem, this is usually a good way of thinking about the goal: An ad is displayed, the user has clicked, and voila! If not, it&apos...