Deep Q learning from demonstrations
The algorithm for Deep Q Learning from Demonstrations (DQfD) is given as follows:
- Initialize the main network parameter
- Initialize the target network parameter by copying the main network parameter
- Initialize the replay buffer with the expert demonstrations
- Set d, the number of time steps we want to delay updating the target network parameter
- Pre-training phase: For steps t = 1, 2, . . ., T:
- Sample a minibatch of experience from the replay buffer
- Compute the loss J(Q)
- Update the parameter of the network using gradient descent
- If t mod d = 0:
Update the target network parameter by copying the main network parameter
- Training phase: For steps t =1, 2, . . ., T:
- Select an action
- Perform the selected action and move to the next state, observe the reward, and store this transition information in the replay buffer ...