First-Visit MC Prediction
The algorithm of first-visit MC prediction is given as follows:
- Let total_return(s) be the sum of the return of a state across several episodes and N(s) be the counter, that is, the number of times a state is visited across several episodes. Initialize total_return(s) and N(s) as zero for all the states. The policy is given as input.
- For M number of iterations:
- Generate an episode using the policy
- Store all rewards obtained in the episode in a list called rewards
- For each step t in the episode:
If the state st is occurring for the first time in the episode:
- Compute the return of the state st as R(st) = sum(rewards[t:])
- Update the total return of the state st as total_return(st) = total_return(st) + R(st)
- Update the counter as N(st) = N(st) + 1
- Compute the value of a state by just taking the average, that is: