Performing model-based learning
As the name suggests, the learning is augmented using a predefined model. Here, the model is represented in the form of transition probabilities and the key objective is to determine the optimal policy and value functions using these predefined model attributes (that is, TPMs
). The policy is defined as a learning mechanism of an agent, traversing across multiple states. In other words, identifying the best action of an agent in a given state, to traverse to a next state, is termed a policy.
The objective of the policy is to maximize the cumulative reward of transitioning from the start state to the destination state, defined as follows, where P(s) is the cumulative policy P from a start state s, and R is the reward of transitioning from state st to state st+1 by performing an action at.
The value function is of two types: the state-value function and the state-action value function. In the state-value function, for a given policy, it is defined as an expected...