Dueling DQN to play Cartpole
In this section, we will look at a modification of the original DQN network, called the Dueling DQN network, the network architecture. It explicitly separates the representation of state values and (state-dependent) action advantages. The dueling architecture consists of two streams that represent the value and advantage functions while sharing a common convolutional feature learning module.
The two streams are combined via an aggregating layer to produce an estimate of the state-action value function Q, as shown in the following diagram:
A single stream Q network (top) and the dueling Q network (bottom).
The dueling network has two streams to separately estimate the (scalar) state value (referred to as V(...)) and the advantages (referred to as A(...)) for each action; the green output module implements the following equation to combine them. Both networks output Q values for each action.
Instead of defining Q, we will be using the simple following equation:
A term...