Q-Learning
Imagine that we have an agent who will be moving through a maze environment, somewhere in which is a reward. The task we have is to find the best path for getting to the reward as quickly as possible. To help us think about this, let's start with a very simple maze environment:
In the maze pictured, the agent can move between any of the nodes, in both directions, by following the lines. The node the agent is in is its state; moving along a line to a different node is an action. There is a reward of 4 if the agent gets to the goal in state D. We want to come up with the optimum path through the maze from any starting node.
Let's think about this problem for a moment. If moving along a line puts us in state D, then that will always be the path we want to take as that will give us the 4 reward in the next time step. Then going back a step...