Implementing Q-Learning
Q-Learning is a model-free method of finding the optimal policy that can maximize the reward of an agent. During initial gameplay, the agent learns a Q value for each pair of (state, action), also known as the exploration strategy, as explained in previous sections. Once the Q values are learned, then the optimal policy will be to select an action with the largest Q-value in every state, also known as the exploitation strategy. The learning algorithm may end in locally optimal solutions, hence we keep using the exploration policy by setting an exploration_rate
parameter.
The Q-Learning algorithm is as follows:
initialize Q(shape=[#s,#a]) to random values or zeroes Repeat (for each episode) observe current state s Repeat select an action a (apply explore or exploit strategy) observe state s_next as a result of action a update the Q-Table using bellman's equation set current state s = s_next until the episode ends or...