Q-learning is an algorithm designed to solve an MDP; that is, a type of control problem that seeks to optimize a variable within a set of constraints. An MDP is built on a Markov chain; a state model in which determining the probability distribution of reaching future states does not require knowledge of any previous states beyond the current one.
An MDP builds on a Markov chain by introducing actions and rewards that can be taken by a learning agent, and allows for choice and decision-making in a stochastic process. Q-learning, as well as other RL algorithms, models the state space of an MDP and progressively reaches an optimal solution by simulating the decisions of a learning agent working within the constraints of the model.
In the next chapter, we'll explore the OpenAI Gym package, the different environments we'll be using, and get comfortable working with...