One of the most well-known reinforcement learning techniques, and the one we will be implementing in our example, is Q-learning.
Q-learning can be used to find an optimal action for any given state in a finite Markov decision process. Q-learning tries to maximize the value of the Q-function that represents the maximum discounted future reward when we perform action a in state s.
Once we know the Q-function, the optimal action a in state s is the one with the highest Q-value. We can then define a policy π(s), that gives us the optimal action in any state, expressed as follows:
![](https://static.packt-cdn.com/products/9781786469878/graphics/assets/2ba42e7f-5b99-43c0-ae95-233c612dff8c.png)
We can define the Q-function for a transition point (st, at, rt, st+1) in terms of the Q-function at the next point (st+1, at+1, rt+1, st+2), similar to what we did with the total discounted future reward. This equation is known as the Bellman equation for Q-learning...