On-Policy MC Control – Exploring starts
The algorithm for on-policy MC control by exploring the starts method is given as follows:
- Let total_return(s, a) be the sum of the return of a state-action pair across several episodes and N(s, a) be the number of times a state-action pair is visited across several episodes. Initialize total_return(s, a) and N(s, a) for all state-action pairs to zero and initialize a random policy .
- For M number of iterations:
- Select the initial state s0 and initial action a0 randomly such that all state-action pairs have a probability greater than 0
- Generate an episode from the selected initial state s0 and action a0 using policy
- Store all the rewards obtained in the episode in the list called rewards
- For each step t in the episode:
If (st, at) is occurring for the first time in the episode:
- Compute the return of a state-action pair, R(st, at) = sum(rewards[t:...