Our journey begins with understanding what reinforcement learning is about. Those who are familiar with machine learning may be aware of several learning paradigms, namely supervised learning and unsupervised learning. In supervised learning, a machine learning model has a supervisor that gives the ground truth for every data point. The model learns by minimizing the distance between its own prediction and the ground truth. The dataset is thus required to have an annotation for each data point, for example, each image of a dog and a cat would have its respective label. In unsupervised learning, the model does not have access to the ground truths of the data and thus has to learn about the distribution and patterns of the data without them.
In reinforcement learning, the agent refers to the model/algorithm that learns to complete a particular task. The agent learns primarily by receiving reward signals, which is a scalar indication of how well the agent is performing a task.
Suppose we have an agent that is tasked with controlling a robot's walking movement; the agent would receive positive rewards for successfully walking toward a destination and negative rewards for falling/failing to make progress.
Moreover, unlike in supervised learning, these reward signals are not given to the model immediately; rather, they are returned as a consequence of a sequence of actions that the agent makes. Actions are simply the things an agent can do within its environment. The environment refers to the world in which the agent resides and is primarily responsible for returning reward signals to the agent. An agent's actions are usually conditioned on what the agent perceives from the environment. What the agent perceives is referred to as the observation or the state of the environment. What further distinguishes reinforcement learning from other paradigms is that the actions of the agent can alter the environment and its subsequent responses.
For example, suppose an agent is tasked with playing Space Invaders, the popular Atari 2600 arcade game. The environment is the game itself, along with the logic upon which it runs. During the game, the agent queries the environment to make an observation. The observation is simply an array of the (210, 160, 3) shape, which is the screen of the game that displays the agent's ship, the enemies, the score, and any projectiles. Based on this observation, the agent makes some actions, which can include moving left or right, shooting a laser, or doing nothing. The environment receives the agent's action as input and makes any necessary updates to the state.
For instance, if a laser touches an enemy ship, it is removed from the game. If the agent decides to simply move to the left, the game updates the agent's coordinates accordingly. This process repeats until a terminal state, a state that represents the end of the sequence, is reached. In Space Invaders, the terminal state corresponds to when the agent's ship is destroyed, and the game subsequently returns the score that it keeps track of, a value that is calculated based on the number of enemy ships the agent successfully destroys.
Let's recap the terms we have learned about so far:
Term | Description | Examples |
Agent | A model/algorithm that is tasked with learning to accomplish a task. | Self-driving cars, walking robots, video game players |
Environment | The world in which the agent acts. It is responsible for controlling what the agent perceives and providing feedback on how well the agent is performing a particular task. | The road on which a car drives, a video game, the stock market |
Action | A decision the agent makes in an environment, usually dependent on what the agent perceives. | Steering a car, buying or selling a particular stock, shooting a laser from the spaceship the agent is controlling |
Reward signal | A scalar indication of how well the agent is performing a particular task. | Space Invaders score, return on investment for some stock, distance covered by a robot learning to walk |
Observation/state | A description of the environment as can be perceived by the agent. | Video from a dashboard camera, the screen of the game, stock market statistics |
Terminal state | A state at which no further actions can be made by the agent. | Reaching the end of a maze, the ship in Space Invaders getting destroyed |
Put formally, at a given timestep, t, the following happens for an agent, P, and environment, E:
- P queries E for some observation
- P decides to take action based on observation
- E receives and returns reward based on the action
- P receives
- E updates to based on and other factors
How does the environment computeand ? The environment usually has its own algorithm that computes these values based on numerous input/factors, including what action the agent takes.
Sometimes, the environment is composed of multiple agents that try to maximize their own rewards. The way gravity acts upon a ball that we drop from a height is a good representation of how the environment works; just like how our surroundings obey the laws of physics, the environment has some internal mechanism for computing rewards and the next state. This internal mechanism is usually hidden to the agent, and thus our job is to build agents that can learn to do a good job at their respective tasks, despite this uncertainty.
In the following sections, we will discuss in more detail the main protagonist of every reinforcement learning problem—the agent.