Direct policy search through policy gradient
The last method we're going to discuss doesn't employ a proxy to find the optimal policy, but rather looks for it directly. In this context, we always assume we're working with a stochastic environment where the transitions are not fully deterministic. For example, a robot can assign a high probability to a transition but, in order to increase the robustness, it has also to include a noise term that can lead the transition to a different state. Therefore, we need to include a transition probability .
Given this element, we can evaluate the overall probability of an entire sequence (that normally ends in a final state): Sk = (s1, s2, …, sk). To do so, we need to define a parameterized policy ; therefore, the probability of Sk can be expressed as a conditional one: . The full expression requires us to explicitly separate the state transition component from the policy and to introduce the actions (which were implicit...