Categorical DQN
The algorithm for a categorical DQN is given as follows:
- Initialize the main network parameter
with random values
- Initialize the target network parameter
by copying the main network parameter
- Initialize the replay buffer
, the number of atoms, and also
and
- For N number of episodes, perform step 5
- For each step in the episode, that is, for t = 0, . . ., T – 1:
- Feed the state s and support values to the main categorical DQN parameterized by
, and get the probability value for each support value. Then compute the Q value as
.
- After computing the Q value, select an action using the epsilon-greedy policy, that is, with probability epsilon, select a random action a and with probability 1-epsilon, select an action as
.
- Perform the selected action and move to the next state
and obtain the reward r.
- Store the transition information in the replay buffer
.
- Randomly sample a transition...
- Feed the state s and support values to the main categorical DQN parameterized by