In this chapter, we have coded a reinforcement-learning system using Q-learning. We defined our environment or playing surface and then looked at the dataset containing every possible combination of states, actions, and future states. Using the dataset, we calculated the value of every state–action pair, which we stored in a hash environment and also as a matrix. We then used this matrix of values as the basis of our policy, which selects the move with the most value.
In our next chapter, we will expand on Q-learning by adding neural networks to create deep Q-learning networks.