Off-Policy TD Control – Q Learning
The algorithm for off-policy TD control – Q learning is given as follows:
- Initialize a Q function Q(s, a) with random values
- For each episode:
- Initialize the state s
- For each step in the episode:
- Extract a policy from Q(s, a) and select an action a to perform in the state s
- Perform the action a, move to the new state , and observe the reward r
- Update the Q value to
- Update (update the next state to the current state s)
- If s is not a terminal state, repeat steps 1 to 5