MC Prediction – the Q Function
The algorithm for MC prediction of the Q function is given as follows:
- Let total_return(s, a) be the sum of the return of a state-action pair across several episodes and N(s, a) be the number of times a state-action pair is visited across several episodes. Initialize total_return(s, a) and N(s, a) for all state-action pairs to zero. The policy is given as input.
- For M number of iterations:
- Generate an episode using policy
- Store all the rewards obtained in the episode in the list called rewards
- For each step t in the episode:
- Compute the return for the state-action pair, R(st, at) = sum(rewards[t:])
- Update the total return of the state-action pair, total_return(st, at) = total_return(st, at) + R(st, at)
- Update the counter as N(st, at) = N(st, at) + 1
- Compute the Q function (Q value) by just taking the average, that is: