So far, we have seen how to derive implicit policies from a value function with the value-based approach. Here, an agent will try to learn the policy directly. The approach is similar, any experienced agent will change the policy after witnessing it.
Value iteration, policy iteration, and Q-learning come under the value-based approach solved by dynamic programming, while the policy optimization approach involves policy gradients and union of this knowledge along with policy iteration, giving rise to actor-critic algorithms.
As per the dynamic programming method, there are a set of self-consistent equations to satisfy the Q and V values. Policy optimization is different, where policy learning happens directly, unlike deriving from the value function:
Thus, value-based methods learn the value function and we derive an implicit policy, but with policy-based methods...