MC Control Method
The algorithm for the MC control method is given as follows:
- Let total_return(s, a) be the sum of the return of a state-action pair across several episodes and N(s, a) be the number of times a state-action pair is visited across several episodes. Initialize total_return(s, a) and N(s, a) for all state-action pairs to zero and initialize a random policy .
- For M number of iterations:
- Generate an episode using policy
- Store all the rewards obtained in the episode in the list called rewards
- For each step t in the episode:
If (st, at) is occurring for the first time in the episode:
- Compute the return of a state-action pair,R(st, at) = sum(rewards[t:])
- Update the total return of the state-action pair as total_return(st, at) = total_return(st, at) + R(st, at)
- Update the counter as N(st, at) = N(st, at) + 1
- Compute the Q value by just taking the average, that is...