One of the dilemmas we face in RL is the balance between exploring all possible actions and exploiting the best possible action. In the multi-armed bandit problem, our search space was small enough to do this with brute force, essentially just by pulling each arm one by one. However, in more complex problems, the number of states could exceed the number of atoms in the known universe. Yes, you read that correctly. In those cases, we need to establish a policy or method whereby we can balance the exploration and exploitation dilemma. There are a few ways in which we can do this, and the following are the most common ways you can approach this:
- Greedy Optimistic: The agent initially starts with high values in its q table. This forces the agent to explore all states at least once, since the agent otherwise always greedily chooses the best action.
- Greedy...