Distributed Distributional DDPG
The Distributed Distributional Deep Deterministic Policy Gradient (D4PG) algorithm is given as follows:
- Initialize the critic network parameter
and the actor network parameter
- Initialize the target critic network parameter
and the target actor network parameter
by copying from
and
, respectively
- Initialize the replay buffer
- Launch L number of actors
- For N number of episodes, repeat step 6
- For each step in the episode, that is, for t = 0, . . ., T – 1:
- Randomly sample a minibatch of K transitions from the replay buffer
- Compute the target value distribution of the critic, that is,
- Compute the loss of the critic network and calculate the gradient as
- After computing gradients, update the critic network parameter using gradient descent:
- Compute the gradient of the actor network
- Update the actor network parameter by gradient ascent:
- Randomly sample a minibatch of K transitions from the replay buffer