The whole Q-learning process
Let's summarize the different steps of the whole Q-learning process. To be clear, the only purpose of this process is to update the Q-values over a certain number of iterations until they are no longer updated (we refer to that point as convergence).
The number of iterations depends on the complexity of the problem. For our problem, 1,000 will be enough, but for more complex problems you might want to consider higher numbers such as 10,000. In short, the Q-learning process is the part where we train our AI, and it's called Q-learning because it's the process during which the Q-values are learned. Then I'll explain what happens for the inference part (pure predictions), which comes, as always, after the training. The full Q-learning process starts with training mode.
Training mode
Initialization (First iteration):
For all couples of states s and actions a, the Q-values are initialized to 0.
Next iterations...