Reinforcement learning solution methods
In this section, we will discuss in detail some of the methods to solve Reinforcement Learning problems. Specifically, dynamic programming (DP), Monte Carlo method, and temporal-difference (TD) learning. These methods address the problem of delayed rewards as well.
Dynamic Programming (DP)
DP is a set of algorithms that are used to compute optimal policies given a model of environment like Markov Decision Process. Dynamic programming models are both computationally expensive and assume perfect models; hence, they have low adoption or utility. Conceptually, DP is a basis for many algorithms or methods used in the following sections:
- Evaluating the policy: A policy can be assessed by computing the value function of the policy in an iterative manner. Computing value function for a policy helps find better policies.
- Improving the policy: Policy improvement is a process of computing the revised policy using its value function information.
- Value iteration and...