The next step in reducing the variance is making our baseline state-dependent (which, intuitively, is a good idea, as different states could have very different baselines). Indeed, to decide about the suitability of a particular action in some state, we're using the discounted total reward of the action. However, the total reward itself could be represented as a value of the state plus advantage of the action: Q(s, a) = V(s) + A(s, a). We've seen this in Chapter 7, DQN Extensions, when we discussed DQN modifications, particularly dueling DQN.
So, why can't we use V(s) as a baseline? In that case, the scale of our gradient will be just advantage A(s, a), showing how this taken action is better in respect to the average state's value. In fact, we can do this, and it is a very good idea for improving the PG method. The only problem here is: we don't know the value of the V(s) state to subtract it from the discounted total reward Q(s, a). To solve this,...