The theoretical background of the cross-entropy method
This section is optional and is included for readers who want to understand why the method works. If you wish, you can refer to the original paper by Kroese, titled Cross-entropy method, [Kro+11].
The basis of the cross-entropy method lies in the importance sampling theorem, which states this:
In our RL case, H(x) is a reward value obtained by some policy x, and p(x) is a distribution of all possible policies. We don’t want to maximize our reward by searching all possible policies; instead, we want to find a way to approximate p(x)H(x) by q(x), iteratively minimizing the distance between them. The distance between two probability distributions is calculated by Kullback-Leibler (KL) divergence, which is as follows:
The first term in KL is called entropy and it doesn’t depend on p2(x), so it could...