Q-Learning is a model-free method of finding the optimal policy that can maximize the reward of an agent. During initial gameplay, the agent learns a Q value for each pair of (state, action), also known as the exploration strategy, as explained in previous sections. Once the Q values are learned, then the optimal policy will be to select an action with the largest Q-value in every state, also known as the exploitation strategy. The learning algorithm may end in locally optimal solutions, hence we keep using the exploration policy by setting an exploration_rate parameter.
The Q-Learning algorithm is as follows:
initialize Q(shape=[#s,#a]) to random values or zeroes
Repeat (for each episode)
observe current state s
Repeat
select an action a (apply explore or exploit strategy)
observe state s_next as a result of action a
update the...