Summary
Reinforcement learning describes the tasks of optimizing an agent stumbling into rewards episodically. Online, offline, value-based, or policy-based algorithms have been developed with the help of deep neural networks for various games and simulation environments.
Policy-gradients are a brute-force solution that require the sampling of actions during training and are better suited for small action spaces, although they provide first solutions for continuous search spaces.
Policy-gradients also work to train non-differentiable stochastic layers in a neural net and back propagate gradients through them. For example, when propagation through a model requires to sample following a parameterized submodel, gradients from the top layer can be considered as a reward for the bottom network.
In more complex environments, when there is no obvious reward (for example understanding and inferring possible actions from the objects present in the environment), reasoning helps humans optimize their...