Let's go into the details of what is happening to configure a reinforcement agent using Q-learning. The goal of Q-learning is to create a state–action matrix where a value is assigned for all state–action combinations—that is, if our agent is at any given state, then the values provided determine the action the agent will take to obtain maximum value. We are going to enable the computation of the best policy for our agent by creating a value matrix that provides a calculated value for every possible move:
- To start, we need a set of state and action pairs that all have a value of 0. As a best practice, we will use hashing here, which is a more efficient alternative to large lists for scaling up to more complex environments. To begin, we will load the hash library and then we will use a for loop to populate the hash...