The goal of the policy optimization method is to find the stochastic policy that is a distribution of actions for a given state that maximizes the expected sum of rewards. It aims to find the policy directly. The basic overview is to create a neural network (that is, policy network) that processes some state information and outputs the distribution of possible actions that an agent might take.
The two major components of policy optimization are:
- The weight parameter of the neural network is defined by vector, which is also the parameter of our control policy. Thus, our aim is to train the weight parameters to obtain the best policy. Since we value the policy as the expected sum of rewards for the given policy. Here, for different parameter values of , policy will differ and hence, the optimal policy would be the one having...