In this section, we'll describe our first algorithm, which does not require full knowledge of the environment (model-free): the Monte Carlo (MC) method (yay, I guess...). Here, the agent uses its own experience to find the optimal policy.
Monte Carlo methods
Policy evaluation
In the Dynamic programming section, we'll describe how to estimate the value function, , given a policy, π (planning). MC does this by playing full episodes, and then averaging the cumulative returns for each state over the different episodes.
Let's see how it works in the following steps:
- Input the policy, π.
- Initialize the following:
- Thetable with some value for all states
- An empty list of returns(s) for each state, s