Back in Chapter 5, Introducing DRL, we learned that a Markov Decision Process (MDP) is used to define the state/model an agent uses to calculate an action/value from. In the case of Q-learning, we have seen how a table or grid could be used to hold an entire MDP for an environment such as the Frozen Pond or GridWorld. These types of RL are model-based, meaning they completely model every state in the environment—every square in a grid game, for instance. Except, in most complex games and environments, being able to map physical or visual state becomes a partially observable problem, or what we may refer to as a partially observable Markov decision process (POMDP).
A POMDP defines a process where an agent never has a complete view of its environment, but instead learns to conduct actions based on a derived general policy....