Dueling DQN to play Cartpole
In this section, we will look at a modification of the original DQN network, called the Dueling DQN network, the network architecture. It explicitly separates the representation of state values and (state-dependent) action advantages. The dueling architecture consists of two streams that represent the value and advantage functions while sharing a common convolutional feature learning module.
The two streams are combined via an aggregating layer to produce an estimate of the state-action value function Q, as shown in the following diagram:
![](https://static.packt-cdn.com/products/9781788621755/graphics/995fbe60-e65c-4bd3-823a-54792a51fbe3.png)
A single stream Q network (top) and the dueling Q network (bottom).
The dueling network has two streams to separately estimate the (scalar) state value (referred to as V(...)) and the advantages (referred to as A(...)) for each action; the green output module implements the following equation to combine them. Both networks output Q values for each action.
Instead of defining Q, we will be using the simple following equation:
![](https://static.packt-cdn.com/products/9781788621755/graphics/2578ac63-e30c-49c1-923a-8a5d35bed233.png)
A term...