In this recipe, we will tackle the exploitation and exploration dilemma in the advertising bandits problem using another algorithm, Thompson sampling. We will see how it differs greatly from the previous three algorithms.
Thompson sampling (TS) is also called Bayesian bandits as it applies the Bayesian way of thinking from the following perspectives:
- It is a probabilistic algorithm.
- It computes the prior distribution for each arm and samples a value from each distribution.
- It then selects the arm with the highest value and observes the reward.
- Finally, it updates the prior distribution based on the observed reward. This process is called Bayesian updating.
As we have seen that in our ad optimization case, the reward for each arm is either 1 or 0. We can use beta distribution for our prior distribution because...