Discussing Q-learning
The key difference between policy optimization and Q-learning is the fact that in the latter, we are not directly optimizing the policy. Instead, we optimize a value function. What is a value function? We have already learned that RL is all about an agent learning to gain the maximum overall rewards while traversing a trajectory of states and actions. A value function is a function of a given state the agent is currently at, and this function outputs the expected sum of rewards the agent will receive by the end of the current episode.
In Q-learning, we optimize a specific type of value function, known as the action-value function, which depends on both the current state and the action. At a given state, S, the action-value function determines the long-term rewards (rewards until the end of the episode) the agent will receive for taking action a. This function is usually expressed as Q(S, a), and hence is also called the Q-function. The action-value is also...