We can define a container as a process with all its requirements isolated using cgroups and namespace kernel features. A process is the way we execute a task within the operating system. If we define a program as the set of instructions developed using a programming language, included in an executable format on disk, we can say that a process is a program in action.
The execution of a process involves the use of some system resources, such as CPU and memory, and although it runs on its own environment, it can use the same information as other processes sharing the same host system.
Operating systems provide tools for manipulating the behavior of processes during execution, allowing system administrators to prioritize the critical ones. Each process running on a system is uniquely identified by a Process Identifier (PID). A parent-child relationship between processes is developed when one process executes a new process (or creates a new thread) during its execution. The new process (or sub-process) that’s created will have as its parent the previous one, and so on. The operating system stores information about process relations using PIDs and parent PIDs. Processes may inherit a parent hierarchy from the user who runs them, so users own and manage their own processes. Only administrators and privileged users can interact with other users’ processes. This behavior also applies to child processes created by our executions.
Each process runs on its own environment and we can manipulate its behavior using operating system features. Processes can access files as needed and use pointers to descriptors during execution to manage these filesystem resources.
The operating system kernel manages all processes, scheduling them on its physical or virtualized CPUs, giving them appropriate CPU time, and providing them with memory or network resources (among others).
These definitions are common to all modern operating systems and are key for understanding software containers, which we will discuss in detail in the next section.
Understanding the main concepts of containers
We have learned that as opposed to virtualization, containers are processes running in isolation and sharing the host operating system kernel. In this section, we will review the components that make containers possible.
Kernel process isolation
We already introduced kernel process namespace isolation as a key feature for running software containers. Operating system kernels provide namespace-based isolation. This feature has been present in Linux kernels since 2006 and provides different layers of isolation associated with the properties or attributes a process has when it runs on a host. When we apply these namespaces to processes, they will run their own set of properties and will not see the other processes running alongside them. Hence, kernel resources are partitioned such that each set of processes sees different sets of resources. Resources may exist in multiple spaces and processes may share them.
Containers, as they are host processes, run with their own associated set of kernel namespaces, such as the following:
- Processes: The container’s main process is the parent of others within the container. All these processes share the same process namespace.
- Network: Each container receives a network stack with unique interfaces and IP addresses. Processes (or containers) sharing the same network namespace will get the same IP address. Communications between containers pass through host bridge interfaces.
- Users: Users within containers are unique; therefore, each container gets its own set of users, but these users are mapped to real host user identifiers.
- Inter-process communication (IPC): Each container receives its own set of shared memory, semaphores, and message queues so that it doesn’t conflict with other processes on the host.
- Mounts: Each container mounts a root filesystem; we can also attach remote and host local mounts.
- Unix time-sharing (UTS): Each container is assigned a hostname and the time is synced with the underlying host.
Processes running inside a container sharing the same process kernel namespace will receive PIDs as if they were running alone inside their own kernel. The container’s main process is assigned PID 1 and other sub-processes or threads will get subsequent IDs, inheriting the main process hierarchy. The container will die if the main process dies (or is stopped).
The following diagram shows how our system manages container PIDs inside the container’s PID namespace (represented by the gray box) and outside, at the host level:
Figure 1.1 – Schema showing a hierarchy of PIDs when you execute an NGINX web server with four worker processes
In the preceding figure, the main process running inside a container is assigned PID 1, while the other processes are its children. The host runs its own PID 1 process and all other processes run in association with this initial process.
Control groups
A cgroup is a feature provided by the Linux kernel that enables us to limit and isolate the host resources associated with processes (such as CPU, memory, and disk I/O). This provides the following features:
- Resource limits: Host resources are limited by using a cgroup and thus, the number of resources that a process can use, including CPU or memory
- Prioritization: If resource contention is observed, the amount of host resources (CPU, disk, or network) that a process can use compared to processes in another cgroup can be controlled
- Accounting: Cgroups monitor and report resource limits usage at the cgroup level
- Control: We can manage the status of all processes in a cgroup
The isolation of cgroups will not allow containers to bring down a host by exhausting its resources. An interesting fact is that you can use cgroups without software containers just by mounting a cgroup (cgroup type system), adjusting the CPU limits of this group, and finally adding a set of PIDs to this group. This procedure will apply to either cgroups-V1 or the newer cgroups-V2.
Container runtime
A container runtime, or container engine, is a piece of software that runs containers on a host. It is responsible for downloading container images from a registry to create containers, monitoring the resources available in the host to run the images, and managing the isolation layers provided by the operating system. The container runtime also reviews the current status of containers and manages their life cycle, starting again when their main process dies (if we declare them to be available whenever this happens).
We generally group container runtimes into low-level runtimes and high-level runtimes.
Low-level runtimes are those simple runtimes focused only on software container execution. We can consider runC and crun in this group. Created by Docker and the Open Container Initiative (OCI), runC is still the de facto standard. Red Hat created crun, which is faster than runC with a lower memory footprint. These low-level runtimes do not require container images to run – we can use a configuration file and a folder with our application and all its required files (which is the content of a Docker image, but without any metadata information). This folder usually contains a file structure resembling a Linux root filesystem, which, as we mentioned before, is everything required by an application (or component) to work. Imagine that we execute the ldd
command on our binaries and libraries and iterate this process with all its dependencies, and so on. We will get a complete list of all the files strictly required for the process and this would become the smallest image for the application.
High-level container runtimes usually implement the Container Runtime Interface (CRI) specification of the OCI. This was created to make container orchestration more runtime-agnostic. In this group, we have Docker, CRI-O, and Windows/Hyper-V containers.
The CRI interface defines the rules so that we can integrate our container runtimes into container orchestrators, such as Kubernetes. Container runtimes should have the following characteristics:
- Be capable of starting/stopping pods
- Deal with all containers (start, pause, stop, and delete them)
- Manage container images
- Provide metrics collection and access to container logs
The Docker container runtime became mainstream in 2016, making the execution of containers very easy for users. CRI-O was created explicitly for the Kubernetes orchestrator by Red Hat to allow the execution of containers using any OCI-compliant low-level runtime. High-level runtimes provide tools for interacting with them, and that’s why most people choose them.
A middle ground between low-level and high-level container runtimes is provided by Containerd, which is an industry-standard container runtime. It runs on Linux and Windows and can manage the complete container life cycle.
The technology behind runtimes is evolving very fast; we can even improve the interaction between containers and hosts using sandboxes (gVisor from Google) and virtualized runtimes (Kata Containers). The former increases containers’ isolation by not sharing the host’s kernel with them. A specific kernel (the small unikernel with restricted capabilities) is provided to containers as a proxy to the real kernel. Virtualized runtimes, on the other hand, use virtualization technology to isolate a container within a very small virtual machine. Although both cases add some load to the underlying operating system, security is increased as containers don’t interact directly with the host’s kernel.
Container runtimes only review the main process execution. If any other process running inside a container dies and the main process isn’t affected, the container will continue running.
Kernel capabilities
Starting with Linux kernel release 2.2, the operating system divides process privileges into distinct units, known as capabilities. These capabilities can be enabled or disabled by operating system and system administrators.
Previously, we learned that containers run processes in isolation using the host’s kernel. However, it is important to know that only a restricted set of these kernel capabilities are allowed inside containers unless they are explicitly declared. Therefore, containers improve their processes’ security at the host level because those processes can’t do anything they want. The capabilities that are currently available inside a container running on top of the Docker container runtime are SETPCAP
, MKNOD
, AUDIT_WRITE
, CHOWN
, NET_RAW
, DAC_OVERRIDE
, FOWNER
, FSETID
, KILL
, SETGID
, SETUID
, NET_BIND_SERVICE
, SYS_CHROOT
, and SETFCAP
.
This set of capabilities allows, for example, processes inside a container to attach and listen on ports below 1024
(the NET_BIND_SERVICE
capability) or use ICMP (the NET_RAW
capability).
If our process inside a container requires us to, for example, create a new network interface (perhaps to run a containerized OpenVPN server), the NET_ADMIN
capability should be included.
Important note
Container runtimes allow containers to run with full privileges using special parameters. The processes within these containers will run with all kernel capabilities and it could be very dangerous. You should avoid using privileged containers – it is best to take some time to verify which capabilities are needed by an application to work correctly.
Container orchestrators
Now that we know that we need a runtime to execute containers, we must also understand that this will work in a standalone environment, without hardware high availability. This means that server maintenance, operating system upgrades, and any other problem at the software, operating system, or hardware levels may affect your application.
High availability requires resource duplicity and thus more servers and/or hardware. These resources will allow containers to run on multiple hosts, each one with a container runtime. However, maintaining application availability in this situation isn’t easy. We need to ensure that containers will be able to run on any of these nodes; in the Overlay filesystems section, we’ll learn that synchronizing container-related resources within nodes involves more than just copying a few files. Container orchestrators manage node resources and provide them to containers. They schedule containers as needed, take care of their status, provide resources for persistence, and manage internal and external communications (in Chapter 6, Fundamentals of Orchestration, we will learn how some orchestrators delegate some of these features to different modules to optimize their work).
The most famous and widely used container orchestrator today is Kubernetes. It has a lot of great features to help manage clustered containers, although the learning curve can be tough. Also, Docker Swarm is quite simple and allows you to quickly execute your applications with high availability (or resilience). We will cover both in detail in Chapter 7, Orchestrating with Swarm, and Chapter 8, Deploying Applications with the Kubernetes Orchestrator. There were other opponents in this race but they stayed by the wayside while Kubernetes took the lead.
HashiCorp’s Nomad and Apache’s Mesos are still being used for very special projects but are out of scope for most enterprises and users. Kubernetes and Docker Swarm are community projects and some vendors even include them within their enterprise-ready solutions. Red Hat’s OpenShift, SUSE’s Rancher, Mirantis’ Kubernetes Engine (old Docker Enterprise platform), and VMware’s Tanzu, among others, all provide on-premises and some cloud-prepared custom Kubernetes platforms. But those who made Kubernetes the most-used platform were the well-known cloud providers – Google, Amazon, Azure, and Alibaba, among others, serve their own container orchestration tools, such as Amazon’s Elastic Container Service or Fargate, Google’s Cloud Run, and Microsoft’s Azure Container Instances, and they also package and manage their own Kubernetes infrastructures for us to use (Google’s GKE, Amazon’s EKS, Microsoft’s AKS, and so on). They provide Kubernetes-as-a-Service platforms where you only need an account to start deploying your applications. They also serve you storage, advanced networking tools, resources for publishing your applications, and even follow-the-sun or worldwide distributed architectures.
There are many Kubernetes implementations. The most popular is probably OpenShift or its open source project, OKD. There are others based on a binary that launches and creates all of the Kubernetes components using automated procedures, such as Rancher RKE (or its government-prepared release, RKE2), and those featuring only the strictly necessary Kubernetes components, such as K3S or K0S, to provide the lightest platform for IoT and more modest hardware. And finally, we have some Kubernetes distributions for desktop computers, offering all the features of Kubernetes ready to develop and test applications with. In this group, we have Docker Desktop, Rancher Desktop, Minikube, and Kubernetes in Docker (KinD). We will learn how to use them in this book to develop, package, and prepare applications for production.
We shouldn’t forget solutions for running orchestrated applications based on multiple containers on standalone servers or desktop computers, such as Docker Compose. Docker has prepared a simple Python-based orchestrator for quick application development, managing the container dependencies for us. It is very convenient for testing all of our components together on a laptop with minimum overhead, instead of running a full Kubernetes or Swarm cluster. We will cover this tool, seeing as it has evolved a lot and is now part of the common Docker client command line, in Chapter 5, Creating Multi-Container Applications.
Container images
Earlier in this chapter, we mentioned that containers run thanks to container images, which are used as templates for executing processes in isolation and attached to a filesystem; therefore, a container image contains all the files (binaries, libraries, configurations, and so on) required by its processes. These files can be a subset of some operating system or just a few binaries with configurations built by yourself.
Virtual machine templates are immutable, as are container templates. This immutability means that they don’t change between executions. This feature is key because it ensures that we get the same results every time we use an image for creating a container. Container behavior can be changed using configurations or command-line arguments through the container runtime. This ensures that images created by developers will work in production as expected, and moving applications to production (or even creating upgrades between different releases) will be smooth and fast, reducing the time to market.
Container images are a collection of files distributed in layers. We shouldn’t add anything more than the files required by the application. As images are immutable, all these layers will be presented to containerized processes as read-only sets of files. But we don’t duplicate files between layers. Only files modified on one layer will be stored in the next layer above – this way, each layer retains the changes from the original base layer (referenced as the base image).
The following diagram shows how we create a container image using multiple layers:
Figure 1.2 – Schema of stacked layers representing a container image
A base layer is always included, although it could be empty. The layers above this base layer may include new binaries or just include new meta-information (which does not create a layer but a meta-information modification).
To easily share these templates between computers or even environments, these file layers are packaged into .tar
files, which are finally what we call images. These packages contain all layered files, along with meta-information that describes the content, specifies the process to be executed, identifies the ports that will be exposed to communicate with other containerized processes, specifies the user who will own it, indicates the directories that will be kept out of container life cycle, and so on.
We use different methods to create these images, but we aim to make the process reproducible, and thus we use Dockerfiles as recipes. In Chapter 2, Building Container Images, we will learn about the image creation workflow while utilizing best practices and diving deep into command-line options.
These container images are stored on registries. This application software is intended to store file layers and meta-information in a centralized location, making it easy to share common layers between different images. This means that two images using a common Debian base image (a subset of files from the complete operating system) will share these base files, thus optimizing disk space usage. This can also be employed on containers’ underlying host local filesystems, saving a lot of space.
Another result of the use of these layers is that containers using the same template image to execute their processes will use the same set of files, and only those files that get modified will be stored.
All these behaviors related to the optimized use of files shared between different images and containers are provided by operating systems thanks to overlay filesystems.
Overlay filesystems
An overlay filesystem is a union mount filesystem (a way of combining multiple directories into one that appears to contain their whole combined content) that combines multiple underlying mount points. This results in a structure with a single directory that contains all underlying files and sub-directories from all sources.
Overlay filesystems merge content from directories, combining the file objects (if any) yielded by different processes, with the upper filesystem taking precedence. This is the magic behind container-image layers’ reusability and disk space saving.
Now that we understand how images are packaged and how they share content, let’s go back to learning a bit more about containers. As you may have learned in this section, containers are processes that run in isolation on top of a host operating system thanks to a container runtime. Although the kernel host is shared by multiple containers, features such as kernel namespaces and cgroups provide special containment layers that allow us to isolate them. Container processes need some files to work, which are included in the container space as immutable templates. As you may think, these processes will probably need to modify or create some new files found on container image layers, and a new read-write layer will be used to store these changes. The container runtime presents this new layer to the container to enable changes – we usually refer to this as the container layer.
The following schema outlines the read-write layers coming from the container image template with the newly added container layer, where the container’s running processes store their file modifications:
Figure 1.3 – Container image layers will always be read-only; the container adds a new layer with read-write capabilities
The changes made by container processes are always ephemeral as the container layer will be lost whenever we remove the container, while image layers are immutable and will remain unchanged. With this behavior in mind, it is easy to understand that we can run multiple containers using the same container image.
The following figure represents this situation where three different running containers were created from the same image:
Figure 1.4 – Three different containers run using the same container image
As you may have noticed, this behavior leaves a very small footprint on our operating systems in terms of disk space. Container layers are very small (or at least they should be, and you as a developer will learn which files shouldn’t be left inside the container life cycle).
Container runtimes manage how these overlay folders will be included inside containers and the magic behind that. The mechanism for this is based on specific operating system drivers that implement copy-on-write filesystems. Layers are arranged one on top of the other and only files modified within them are merged on the upper layer. This process is managed at speed by operating system drivers, but some small overhead is always expected, so keep in mind that all files that are modified continuously by your application (logs, for example) should never be part of the container.
Important note
Copy-on-write uses small layered filesystems or folders. Files from any layer are accessible to read access, but write requires searching for the file within the underlying layers and copying this file to the upper layer to store the changes. Therefore, the I/O overhead from reading files is very small and we can keep multiple layers for better file distribution between containers. In contrast, writing requires more resources and it would be better to leave big files and those subject to many or continuous modifications out of the container layer.
It is also important to notice that containers are not ephemeral at all. As mentioned previously, changes in the container layer are retained until the container is removed from the operating system; so, if you create a 10 GB file in the container layer, it will reside on your host’s disk. Container orchestrators manage this behavior, but be careful where you store your persistent files. Administrators should do container housekeeping and disk maintenance to avoid disk-pressure problems.
Developers should keep this in mind and prepare their applications using containers to be logically ephemeral and store persistent data outside the container’s layers. We will learn about options for persistence in Chapter 10, Leveraging Application Data Management in Kubernetes.
This thinking leads us to the next section, where we will discuss the intrinsic dynamism of container environments.
Understanding dynamism in container-based applications
We have seen how containers run using immutable storage (container images) and how the container runtime adds a new layer for managing changed files. Although we mentioned in the previous section that containers are not ephemeral in terms of disk usage, we have to include this feature in our application’s design. Containers will start and stop whenever you upgrade your application’s components. Whenever you change the base image, a completely new container will be created (remember the layers ecosystem described in the previous section). This will become even worse if you want to distribute these application components across a cluster – even using the same image will result in different containers being created on different hosts. Thus, this dynamism is inherited in these platforms.
In the context of networking communications inside containers, we know that processes running inside a container share its network namespace, and thus they all get the same network stack and IP address. But every time a new container is created, the container runtime will provide a new IP address. Thanks to container orchestration and the Domain Name System (DNS) included, we can communicate with our containers. As IP addresses are dynamically managed by the container runtime’s internal IP Address Management (IPAM) using defined pools, every time a container dies (whether the main process is stopped, killed manually, or ended by an error), it will free its IP address and IPAM will assign it to a new container that might be part of a completely different application. Hence, we can trust the IP address assignment although we shouldn’t use container IP addresses in our application configurations (or even worse, write them in our code, which is a bad practice in every scenario). IP addresses will be dynamically managed by the IPAM container runtime component by default. We will learn about better mechanisms we can use to reference our application’s containers, such as service names, in Chapter 4, Running Docker Containers.
Applications use fully qualified domain names (or short names if we are using internal domain communications, as we will learn when we use Docker Compose to run multi-container applications, and also when applications run in more complicated container orchestrations).
Because IP addresses are dynamic, special resources should be used to assign sets of IP addresses (or unique IP addresses, if we have just one process replica) to service names. In the same way, publishing application components requires some resource mappings, using network address translation (NAT) for communicating between users and external services and those running inside containers, distributed across a cluster in different servers or even different infrastructures (such as cloud-provided container orchestrators, for example).
Since we’re reviewing the main concepts related to containers in this chapter, we can’t miss out on the tools that are used for creating, executing, and sharing containers.
Tools for managing containers
As we learned previously, the container runtime will manage most of the actions we can achieve with containers. Most of these runtimes run as daemons and provide an interface for interacting with them. Among these tools, Docker stands out as it provides all the tools in a box. Docker acts as a client-server application and in newer releases, both the client and server components are packaged separately, but in any case, both are needed by users. At first, when Docker Engine was the most popular and reliable container engine, Kubernetes adopted it as its runtime. But this marriage did not last long, and Docker Engine was deprecated in Kubernetes release 1.22. This happened because Docker manages its own integration of Containerd, which is not standardized nor directly usable by the Kubernetes CRI. Despite this fact, Docker is still the most widely used option for developing container-based applications and the de facto standard for building images.
We mentioned Docker Desktop and Rancher Desktop earlier in this section. Both act as container runtime clients that use either the docker
or nerdctl
command lines. We can use such clients because in both cases, dockerd
or containerd
act as container runtimes.
Developers and the wider community pushed Docker to provide a solution for users who prefer to run containers without having to run a privileged system daemon, which is dockerd’s default behavior. It took some time but finally, a few years ago, Docker published its rootless runtime with user privileges. During this development phase, another container executor arrived, called Podman, created by Red Hat to solve the same problem. This solution can run without root privileges and aims to avoid the use of a daemonized container runtime. The host user can run containers without any system privilege by default; only a few tweaks are required by administrators if the containers are to be run in a security-hardened environment. This made Podman a very secure option for running containers in production (without orchestration). Docker also included rootless containers by the end of 2019, making both options secure by default.
As you learned at the beginning of this section, containers are processes that run on top of an operating system, isolated using its kernel features. It is quite evident why containers are so popular in microservice environments (one container runs a process, which is ultimately a microservice), although we can still build microservice-based applications without containers. It is also possible to use containers to run whole application components together, although this isn’t an ideal situation.
Important note
In this chapter, we’ll largely focus on software containers in the context of Linux operating systems. This is because they were only introduced in Windows systems much later. However, we will also briefly discuss them in the context of Windows.
We shouldn’t compare containers with virtual nodes. As discussed earlier in this section, containers are mainly based on cgroups and kernel namespaces while virtual nodes are based on hypervisor software. This software provides sandboxing capabilities and specific virtualized hardware resources to guest hosts. We still need to prepare operating systems to run these virtual guest hosts. Each guest node will receive a piece of virtualized hardware and we must manage servers’ interactions as if they were physical.
We’ll compare these models side by side in the following section.