In this chapter, we introduced a new family of RL algorithms that learn from experience by interacting with the environment. These methods differ from dynamic programming in their ability to learn a value function and consequently a policy without relying on the model of the environment.
Initially, we saw that Monte Carlo methods are a simple way to sample from the environment but because they need the full trajectory before starting to learn, they are not applicable in many real problems. To overcome these drawbacks, bootstrapping can be combined with Monte Carlo methods, giving rise to so-called temporal difference (TD) learning. Thanks to the bootstrapping technique, these algorithms can learn online (one-step learning) and reduce the variance while still converging to optimal policies. Then, we learned two one-step, tabular, model-free TD methods, namely SARSA and...