In this chapter, TD learning algorithms were introduced. TD learning algorithms are based on reducing the differences between estimates made by the agent at different times. The SARSA algorithm implements an on-policy TDs method, in which the update of the action value function (Q) is performed based on the results of the transition from the state s = s (t) to the state s' = s (t + 1) by the action a (t), taken on the basis of a selected policy π (s, a). Q-learning, unlike SARSA, has off-policy characteristics, that is, while the policy is improved according to the values estimated by q(s, a), the value function updates the estimates following a strictly greedy secondary policy: given a state, the chosen action is always the one that maximizes the value max𝑞 (𝑠, 𝑎).
Then, the basics of graph theory were addressed: the adjacency matrix...