On-Policy MC Control – Epsilon-Greedy
The algorithm for on-policy MC control with the epsilon-greedy policy is given as follows:
- Let total_return(s, a) be the sum of the return of a state-action pair across several episodes and N(s, a) be the number of times a state-action pair is visited across several episodes. Initialize total_return(s, a) and N(s, a) for all state-action pairs to zero and initialize a random policy .
- For M number of iterations:
- Generate an episode using policy
- Store all the rewards obtained in the episode in the list called rewards
- For each step t in the episode:
If (st, at) is occurring for the first time in the episode:
- Compute the return of a state-action pair, R(st, at) = sum(rewards[t:]).
- Update the total return of the state-action pair as total_return(st, at) = total_return(st, at) + R(st, at).
- Update the counter as N(st, at) = N(st, at) + 1. ...