Dynamic programming (DP) algorithms are effective for solving reinforcement learning (RL) problems, but they require two strong assumptions. The first is that the model of the environment has to be known, and the second is that the state space has to be small enough so that it does not suffer from the curse of dimensionality problem.
In this chapter, we'll develop a class of algorithms that get rid of the first assumption. In addition, it is a class of algorithms that aren't affected by the problem of the curse of dimensionality of DP algorithms. These algorithms learn directly from the environment and from the experience, estimating the value function based on many returns, and do not compute the expectation of the state values using the model, in contrast with DP algorithms. In this new setting, we'll talk about experience as...