Namespaces and cgroups

Users logged into a Linux system have a transparent view of various system entities such as global resources, processes, kernel, and users. For instance, a valid user can access PIDs of all running processes on the system (irrespective of the user to which they belong). Users can observe the presence of other users on the system, and they can run commands to view the state of global system global resources such as memory, filesystem mounts, and devices. Such operations are not deemed as intrusions or considered security breaches, as it is always guaranteed that one user/process can never intrude into other user/process.

However, such transparency is unwarranted on a few server platforms. For instance, consider cloud service providers offering PaaS (platform as a service). They offer an environment to host and deploy custom client applications. They manage runtime, storage, operating system, middleware, and networking services, leaving customers to manage their applications and data. PaaS services are used by various e-commerce, financial, online gaming, and other related enterprises.

For efficient and effective isolation and resource management for clients, PaaS service providers use various tools. They virtualize the system environment for each client to achieve security, reliability, and robustness. The Linux kernel provides low-level mechanisms in the form of cgroups and namespaces for building various lightweight tools that can virtualize the system environment. Docker is one such framework that builds on cgroups and namespaces.

Namespaces fundamentally are mechanisms to abstract, isolate, and limit the visibility that a group of processes has over various system entities such as process trees, network interfaces, user IDs, and filesystem mounts. Namespaces are categorized into several groups, which we will now see.

Mount namespaces

Traditionally, mount and unmount operations will change the filesystem view as seen by all processes in the system; in other words, there is one global mount namespace seen by all processes. The mount namespaces confine the set of filesystem mount points visible within a process namespace, enabling one process group in a mount namespace to have an exclusive view of the filesystem list compared to another process.

UTS namespaces

These enable isolating the system's host and domain name within a uts namespace. This makes initialization and configuration scripts able to be guided based on the respective namespaces.

IPC namespaces

These demarcate processes from using System V and POSIX message queues. This prevents one process from an ipc namespace accessing the resources of another.

PID namespaces

Traditionally, *nix kernels (including Linux) spawn the init process with PID 1 during system boot, which in turn starts other user-mode processes and is considered the root of the process tree (all the other processes start below this process in the tree). The PID namespace allows a process to spin off a new tree of processes under it with its own root process (PID 1 process). PID namespaces isolate process ID numbers, and allow duplication of PID numbers across different PID namespaces, which means that processes in different PID namespaces can have the same process ID. The process IDs within a PID namespace are unique, and are assigned sequentially starting with PID 1.

PID namespaces are used in containers (lightweight virtualization solution) to migrate a container with a process tree, onto a different host system without any changes to PIDs.

Network namespaces

This type of namespace provides abstraction and virtualization of network protocol services and interfaces. Each network namespace will have its own network device instances that can be configured with individual network addresses. Isolation is enabled for other network services: routing table, port number, and so on.

User namespaces

User namespaces allow a process to use unique user and group IDs within and outside a namespace. This means that a process can use privileged user and group IDs (zero) within a user namespace and continue with non-zero user and group IDs outside the namespace.

Cgroup namespaces

A cgroup namespace virtualizes the contents of the /proc/self/cgroup file. Processes inside a cgroup namespace are only able to view paths relative to their namespace root.

Control groups (cgroups)

Cgroups are kernel mechanisms to restrict and measure resource allocations to each process group. Using cgroups, you can allocate resources such as CPU time, network, and memory.

Similar to the process model in Linux, where each process is a child to a parent and relatively descends from the init process thus forming a single-tree like structure, cgroups are hierarchical, where child cgroups inherit the attributes of the parent, but what makes is different is that multiple cgroup hierarchies can exist within a single system, with each having distinct resource prerogatives.

Applying cgroups on namespaces results in isolation of processes into containers within a system, where resources are managed distinctly. Each container is a lightweight virtual machine, all of which run as individual entities and are oblivious of other entities within the same system.

The following are namespace APIs described in the Linux man page for namespaces:

clone(2)
The clone(2) system call creates a new process. If the flags argument of the call specifies one or more of the CLONE_NEW* flags listed below, then new namespaces are created for each flag, and the child process is made a member of those namespaces.(This system call also implements a number of features unrelated to namespaces.)

setns(2)
The setns(2) system call allows the calling process to join an existing namespace. The namespace to join is specified via a file descriptor that refers to one of the /proc/[pid]/ns files described below.

unshare(2)
The unshare(2) system call moves the calling process to a new namespace. If the flags argument of the call specifies one or more of the CLONE_NEW* flags listed below, then new namespaces are created for each flag, and the calling process is made a member of those namespaces. (This system call also implements a number of features unrelated to namespaces.)
Namespace   Constant          Isolates
Cgroup      CLONE_NEWCGROUP   Cgroup root directory
IPC         CLONE_NEWIPC      System V IPC, POSIX message queues
Network     CLONE_NEWNET      Network devices, stacks, ports, etc.
Mount       CLONE_NEWNS       Mount points
PID         CLONE_NEWPID      Process IDs
User        CLONE_NEWUSER     User and group IDs
UTS         CLONE_NEWUTS      Hostname and NIS domain name