Double Q-Learning (DDQN)
In DQN, the target Q-Network selects and evaluates every action resulting in an overestimation of Q value. To resolve this issue, DDQN [3] proposes to use the Q-Network to choose the action and use the target Q-Network to evaluate the action.
In DQN as summarized by Algorithm 9.6.1, the estimate of the Q value in line 10 is:
![Double Q-Learning (DDQN)](https://static.packt-cdn.com/products/9781788629416/graphics/graphics/B08956_09_056.jpg)
Qtarget chooses and evaluates the action a j+1.
DDQN proposes to change line 10 to:
![Double Q-Learning (DDQN)](https://static.packt-cdn.com/products/9781788629416/graphics/graphics/B08956_09_057.jpg)
The term lets Q to choose the action. Then this action is evaluated by Qtarget.
In Listing 9.6.1, both DQN and DDQN are implemented. Specifically, for DDQN, the modification on the Q value computation performed by get_target_q_value()
function is highlighted:
# compute Q_max # use of target Q Network solves the non-stationarity problem def get_target_q_value(self, next_state): # max Q value among next state's actions if self.ddqn: # DDQN # current Q Network selects the action # a'_max = argmax_a' Q(s', a&apos...