Deep Q Learning
The algorithm for deep Q learning is given as follows:
- Initialize the main network parameter
with random values
- Initialize the target network parameter
by copying the main network parameter
- Initialize the replay buffer
- For N number of episodes, perform step 5
- For each step in the episode, that is, for t = 0, . . ., T – 1:
- Observe the state s and select an action using the epsilon-greedy policy, that is, with probability epsilon, select random action a, and with probability 1-epsilon, select the action as
- Perform the selected action and move to the next state
and obtain the reward r
- Store the transition information in the replay buffer
- Randomly sample a minibatch of K transitions from the replay buffer
- Compute the target value, that is,
- Compute the loss,
- Compute the gradients of the loss and update the main network parameter
using gradient descent...
- Observe the state s and select an action using the epsilon-greedy policy, that is, with probability epsilon, select random action a, and with probability 1-epsilon, select the action as