Double Q-Learning (DDQN)
In DQN, the target Q-Network selects and evaluates every action resulting in an overestimation of Q value. To resolve this issue, DDQN [3] proposes to use the Q-Network to choose the action and use the target Q-Network to evaluate the action.
In DQN as summarized by Algorithm 9.6.1, the estimate of the Q value in line 10 is:
Qtarget chooses and evaluates the action a j+1.
DDQN proposes to change line 10 to:
The term lets Q to choose the action. Then this action is evaluated by Qtarget.
In Listing 9.6.1, both DQN and DDQN are implemented. Specifically, for DDQN, the modification on the Q value computation performed by get_target_q_value()
function is highlighted:
# compute Q_max # use of target Q Network solves the non-stationarity problem def get_target_q_value(self, next_state): # max Q value among next state's actions if self.ddqn: # DDQN # current Q Network selects the action # a'_max = argmax_a' Q(s', a&apos...