In reinforcement learning, we want the Q-function Q(s,a) to predict the best action for a state s in order to maximize the future reward. The Q-function is estimated using Q-learning, which involves the process of updating the Q-function using Bellman equations through a series of iterations as follows:

Here:
Q(s,a) = Q value for the current state s and action a pair
 = learning rate of convergence
 = discounting factor of future rewards
Q(s',a') = Q value for the state action pair at the resultant state s' after action a was taken at state s
R = refers to immediate reward
 = future reward
In simpler cases, where state space and action space are discrete, Q-learning is implemented using a Q-table, where rows represent the states and columns represent the actions.Â
Steps involved in Q-learning are as follows:
- Initialize Q-table...