In this chapter, we discussed how temporal difference learning, the third thread of RL, combined to develop TD(0) and Q-learning. We did that by first exploring the temporal credit assignment problem and how it differed from the credit assignment problem. From that, we learned how TD learning works and how TD(0) or first step TD can be reduced to Q-learning.
After that, we again played on the FrozenLake environment to understand how the new algorithm compared to our past efforts. Using model-free off-policy Q-learning allowed us to tackle the more difficult Taxi environment problem. This is where we learned how to tune hyperparameters and finally looked at the difference between off- and on-policy learning. In the next chapter, we continue where we left off with on- versus off-policy as we explore SARSA.