So far, we have represented the value function in the form of a lookup table in the MC and TD methods. The TD method is able to update the Q-function on the fly during an episode, which is considered an advancement on the MC method. However, the TD method is still not sufficiently scalable for problems with many states and/or actions. It will be extremely slow at learning too many values for individual pairs of states and actions using the TD method.
This chapter will focus on function approximation, which can overcome the scaling issues in the TD method. We will begin by setting up the Mountain Car environment playground. After developing the linear function estimator, we will incorporate it into the Q-learning and SARSA algorithms. We will then improve the Q-learning algorithm using experience replay, and experiment with using...