Twin Delayed DDPG
The algorithm for Twin Delayed DDPG (TD3) is given as follows:
- Initialize two main critic networks parameters,
and
, and the main actor network parameter
- Initialize two target critic networks parameters,
and
, by copying the main critic network parameters
and
, respectively
- Initialize the target actor network parameter
by copying the main actor network parameter
- Initialize the replay buffer
- For N number of episodes, repeat step 6
- For each step in the episode, that is, for t = 0, . . ., T – 1:
- Select action a based on the policy
and with exploration noise
, that is,
where,
- Perform the selected action a, move to the next state
, get the reward r, and store the transition information in the replay buffer
- Randomly sample a minibatch of K transitions from the replay buffer
- Select the action
for computing the target value
where
- Compute the target value of the...
- Select action a based on the policy