In the previous recipe, we searched for the optimal policy using MC control with greedy search where the action with the highest state-action value was selected. However, the best choice available in early episodes does not guarantee an optimal solution. If we just focus on what is temporarily the best option and ignore the overall problem, we will be stuck in local optima instead of reaching the global optima. The workaround is epsilon-greedy policy.
In MC control with epsilon-greedy policy, we no longer exploit the best action all the time, but choose an action randomly under certain probabilities. As the name implies, the algorithm has two folds:
- Epsilon: given a parameter, ε, with a value from 0 to 1, each action is taken with a probability calculated as follows:
Here, |A| is the number of possible actions.
- Greedy...