Action selection using upper confidence bounds
Upper confidence bounds (UCB) is a simple yet effective solution to exploration-exploitation trade-off. The idea is that at each time step, we select the action that has the highest potential for reward. The potential of the action is calculated as the sum of the action value estimate and a measure of the uncertainty of this estimate. This sum is what we call the upper confidence bound. So, an action is selected either because our estimate for the action value is high, or the action has not been explored enough (i.e. as many times as the other ones) and there is high uncertainty about its value, or both.
More formally, we select the action to take at time using:
Let's unpack this a little bit:
- Now we have used a notation that is slightly different from what we introduced earlier. and have essentially the same meanings as before. This formula looks at the variable values, which may have...