Complications in RL
The first thing to note is that observations in RL depend on an agent’s behavior and, to some extent, it is the result of this behavior. If your agent decides to do inefficient things, then the observations will tell you nothing about what it has done wrong and what should be done to improve the outcome (the agent will just get negative feedback all the time). If the agent is stubborn and keeps making mistakes, then the observations will give the false impression that there is no way to get a larger reward — life is suffering — which could be totally wrong.
In ML terms, this can be rephrased as having non-IID data. The abbreviation iid stands for independent and identically distributed, a requirement for most supervised learning methods.
The second thing that complicates our agent’s life is that it needs to not only exploit the knowledge it has learned, but actively explore the environment, because maybe doing things differently...