For an RL agent to make a decision, it is important for the agent to learn the Q value function. The Q value function can be learned iteratively via Bellman's equation. When the agent starts to interact with the environment, it starts with a random state s(0) and random state of Q values for every state action pair. The agent's action would also be somewhat random, since it has no state Q values to make informed decisions. For each action taken, the environment would return a reward based on which agent starts to build the Q value tables, and improves over time.
At any exposed state s(t) at iteration t the agent would take an action a(t) that maximizes its long-term reward. The Q table holds the long-term reward values, and hence the chosen a(t) would be based on the following heuristics:
The Q value table is also indexed by iteration...