In this section we'll be solving a multi-armed bandit problem using a simulated set of ad-click data. We'll generate a set of clicks for 5 different advertisements. Each ad will either be clicked or not clicked when it is shown to a user. Our goal is to determine which ad to show next based on how each ad is performing at any given point in the simulation.
We start with a baseline loop that chooses a random advertisement from the selection each time. This model does not learn from its actions and always chooses a random action. If the user clicks on the ad, we get a reward of 1; if not, we get a reward of 0.
We import the necessary packages and generate a data frame of simulated data using random numbers. We will specify a distribution of 90% zero values and 10% 1 values for this example...