Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Containerization with LXC
Containerization with LXC

Containerization with LXC: Build, manage, and configure Linux containers

eBook
R$49.99 R$245.99
Paperback
R$306.99
Subscription
Free Trial
Renews at R$50p/m

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Shipping Address

Billing Address

Shipping Methods
Table of content icon View table of contents Preview book icon Preview Book

Containerization with LXC

Chapter 1. Introduction to Linux Containers

Nowadays, deploying applications inside some sort of a Linux container is a widely adopted practice, primarily due to the evolution of the tooling and the ease of use it presents. Even though Linux containers, or operating-system-level virtualization, in one form or another, have been around for more than a decade, it took some time for the technology to mature and enter mainstream operation. One of the reasons for this is the fact that hypervisor-based technologies such as KVM and Xen were able to solve most of the limitations of the Linux kernel during that period and the overhead it presented was not considered an issue. However, with the advent of kernel namespaces and control groups (cgroups) the notion of a light-weight virtualization became possible through the use of containers.

In this chapter, I'll cover the following topics:

  • Evolution of the OS kernel and its early limitations
  • Differences between containers and platform virtualization
  • Concepts and terminology related to namespaces and cgroups
  • An example use of process resource isolation and management with network namespaces and cgroups

The OS kernel and its early limitations

The current state of Linux containers is a direct result of the problems that early OS designers were trying to solve – managing memory, I/O, and process scheduling in the most efficient way.

In the past, only a single process could be scheduled for work, wasting precious CPU cycles if blocked on an I/O operation. The solution to this problem was to develop better CPU schedulers, so more work can be allocated in a fair way for maximum CPU utilization. Even though the modern schedulers, such as the Completely Fair Scheduler (CFS) in Linux do a great job of allocating fair amounts of time to each process, there's still a strong case for being able to give higher or lower priority to a process and its subprocesses. Traditionally, this can be accomplished by the nice() system call, or real-time scheduling policies, however, there are limitations to the level of granularity or control that can be achieved.

Similarly, before the advent of virtual memory, multiple processes would allocate memory from a shared pool of physical memory. The virtual memory provided some form of memory isolation per process, in the sense that processes would have their own address space, and extend the available memory by means of a swap, but still there wasn't a good way of limiting how much memory each process and its children can use.

To further complicate the matter, running different workloads on the same physical server usually resulted in a negative impact on all running services. A memory leak or a kernel panic could cause one application to bring the entire operating system down. For example, a web server that is mostly memory bound and a database service that is I/O heavy running together became problematic. In an effort to avoid such scenarios, system administrators would separate the various applications between a pool of servers, leaving some machines underutilized, especially at certain times during the day, when there was not much work to be done. This is a similar problem as a single running process blocked on I/O operation is a waste of CPU and memory resources.

The solution to these problems is the use of hypervisor based virtualization, containers, or the combination of both.

The case for Linux containers

The hypervisor as part of the operating system is responsible for managing the life cycle of virtual machines, and has been around since the early days of mainframe machines in the late 1960s. Most modern virtualization implementations, such as Xen and KVM, can trace their origins back to that era. The main reason for the wide adoption of these virtualization technologies around 2005 was the need to better control and utilize the ever-growing clusters of compute resources. The inherited security of having an extra layer between the virtual machine and the host OS was a good selling point for the security minded, though as with any other newly adopted technology there were security incidents.

Nevertheless, the adoption of full virtualization and paravirtulization significantly improved the way servers are utilized and applications provisioned. In fact, virtualization such as KVM and Xen is still widely used today, especially in multitenant clouds and cloud technologies such as OpenStack.

Hypervisors provide the following benefits, in the context of the problems outlined earlier:

  • Ability to run different operating systems on the same physical server
  • More granular control over resource allocation
  • Process isolation – a kernel panic on the virtual machine will not effect the host OS
  • Separate network stack and the ability to control traffic per virtual machine
  • Reduce capital and operating cost, by simplification of data center management and better utilization of available server resources

Arguably the main reason against using any sort of virtualization technology today is the inherited overhead of using multiple kernels in the same OS. It would be much better, in terms of complexity, if the host OS can provide this level of isolation, without the need for hardware extensions in the CPU, or the use of emulation software such as QEMU, or even kernel modules such as KVM. Running an entire operating system on a virtual machine, just to achieve a level of confinement for a single web server, is not the most efficient allocation of resources.

Over the last decade, various improvements to the Linux kernel were made to allow for similar functionality, but with less overhead – most notably the kernel namespaces and cgroups. One of the first notable technologies to leverage those changes was LXC, since kernel 2.6.24 and around the 2008 time frame. Even though LXC is not the oldest container technology, it helped fuel the container revolution we see today.

The main benefits of using LXC include:

  • Lesser overheads and complexity than running a hypervisor
  • Smaller footprint per container
  • Start times in the millisecond range
  • Native kernel support

It is worth mentioning that containers are not inherently as secure as having a hypervisor between the virtual machine and the host OS. However, in recent years, great progress has been made to narrow that gap using Mandatory Access Control (MAC) technologies such as SELinux and AppArmor, kernel capabilities, and cgroups, as demonstrated in later chapters.

Linux namespaces – the foundation of LXC

Namespaces are the foundation of lightweight process virtualization. They enable a process and its children to have different views of the underlying system. This is achieved by the addition of the unshare() and setns() system calls, and the inclusion of six new constant flags passed to the clone(), unshare(), and setns() system calls:

  • clone(): This creates a new process and attaches it to a new specified namespace
  • unshare(): This attaches the current process to a new specified namespace
  • setns(): This attaches a process to an already existing namespace

There are six namespaces currently in use by LXC, with more being developed:

  • Mount namespaces, specified by the CLONE_NEWNS flag
  • UTS namespaces, specified by the CLONE_NEWUTS flag
  • IPC namespaces, specified by the CLONE_NEWIPC flag
  • PID namespaces, specified by the CLONE_NEWPID flag
  • User namespaces, specified by the CLONE_NEWUSER flag
  • Network namespaces, specified by the CLONE_NEWNET flag

Let's have a look at each in more detail and see some userspace examples, to help us better understand what happens under the hood.

Mount namespaces

Mount namespaces first appeared in kernel 2.4.19 in 2002 and provided a separate view of the filesystem mount points for the process and its children. When mounting or unmounting a filesystem, the change will be noticed by all processes because they all share the same default namespace. When the CLONE_NEWNS flag is passed to the clone() system call, the new process gets a copy of the calling process mount tree that it can then change without affecting the parent process. From that point on, all mounts and unmounts in the default namespace will be visible in the new namespace, but changes in the per-process mount namespaces will not be noticed outside of it.

The clone() prototype is as follows:

#define _GNU_SOURCE 
#include <sched.h> 
int clone(int (*fn)(void *), void *child_stack, int flags, void *arg); 

An example call that creates a child process in a new mount namespace looks like this:

child_pid = clone(childFunc, child_stack + STACK_SIZE, CLONE_NEWNS | SIGCHLD, argv[1]); 

When the child process is created, it executes the childFunc function, which will perform its work in the new mount namespace.

The util-linux package provides userspace tools that implement the unshare() call, which effectively unshares the indicated namespaces from the parent process.

To illustrate this:

  1. First open a terminal and create a directory in /tmp as follows:
    root@server:~# mkdir /tmp/mount_ns
    root@server:~#
    
  2. Next, move the current bash process to its own mount namespace by passing the mount flag to unshare:
    root@server:~# unshare -m /bin/bash
    root@server:~#
    
  3. The bash process is now in a separate namespace. Let's check the associated inode number of the namespace:
    root@server:~# readlink /proc/$$/ns/mnt
    mnt:[4026532211]
    root@server:~#
    
  4. Next, create a temporary mount point:
    root@server:~# mount -n -t tmpfs tmpfs /tmp/mount_ns
    root@server:~#
    
  5. Also, make sure you can see the mount point from the newly created namespace:
    root@server:~# df -h | grep mount_ns
    tmpfs           3.9G     0  3.9G   0% /tmp/mount_ns
    
    root@server:~# cat /proc/mounts | grep mount_ns
    tmpfs /tmp/mount_ns tmpfs rw,relatime 0 0
    root@server:~#
    

    As expected, the mount point is visible because it is part of the namespace we created and the current bash process is running from.

  6. Next, start a new terminal session and display the namespace inode ID from it:
    root@server:~# readlink /proc/$$/ns/mnt
    mnt:[4026531840]
    root@server:~#
    

    Notice how it's different from the mount namespace on the other terminal.

  7. Finally, check if the mount point is visible in the new terminal:
    root@server:~# cat /proc/mounts | grep mount_ns
    root@server:~# df -h | grep mount_ns
    root@server:~#
    

Not surprisingly, the mount point is not visible from the default namespace.

Note

In the context of LXC, mount namespaces are useful because they provide a way for a different filesystem layout to exist inside the container. It's worth mentioning that before the mount namespaces, a similar process confinement could be achieved with the chroot() system call, however chroot does not provide the same per-process isolation as mount namespaces do.

UTS namespaces

Unix Timesharing (UTS) namespaces provide isolation for the hostname and domain name, so that each LXC container can maintain its own identifier as returned by the hostname -f command. This is needed for most applications that rely on a properly set hostname.

To create a bash session in a new UTS namespace, we can use the unshare utility again, which uses the unshare() system call to create the namespace and the execve() system call to execute bash in it:

root@server:~# hostname
server
root@server:~# unshare -u /bin/bash
root@server:~# hostname uts-namespace
root@server:~# hostname
uts-namespace
root@server:~# cat /proc/sys/kernel/hostname
uts-namespace
root@server:~#

As the preceding output shows, the hostname inside the namespace is now uts-namespace.

Next, from a different terminal, check the hostname again to make sure it has not changed:

root@server:~# hostname
server
root@server:~#

As expected, the hostname only changed in the new UTS namespace.

To see the actual system calls that the unshare command uses, we can run the strace utility:

root@server:~# strace -s 2000 -f unshare -u /bin/bash
...
unshare(CLONE_NEWUTS)                   = 0
getgid()                                = 0
setgid(0)                               = 0
getuid()                                = 0
setuid(0)                               = 0
execve("/bin/bash", ["/bin/bash"], [/* 15 vars */]) = 0
...

From the output we can see that the unshare command is indeed using the unshare() and execve() system calls and the CLONE_NEWUTS flag to specify new UTS namespace.

IPC namespaces

The Interprocess Communication (IPC) namespaces provide isolation for a set of IPC and synchronization facilities. These facilities provide a way of exchanging data and synchronizing the actions between threads and processes. They provide primitives such as semaphores, file locks, and mutexes among others, that are needed to have true process separation in a container.

PID namespaces

The Process ID (PID) namespaces provide the ability for a process to have an ID that already exists in the default namespace, for example an ID of 1. This allows for an init system to run in a container with various other processes, without causing a collision with the rest of the PIDs on the same OS.

To demonstrate this concept, open up pid_namespace.c:

#define _GNU_SOURCE 
#include <stdlib.h> 
#include <stdio.h> 
#include <signal.h> 
#include <sched.h> 
 
static int childFunc(void *arg) 
{ 
    printf("Process ID in child  = %ld\n", (long) getpid()); 
} 

First, we include the headers and define the childFunc function that the clone() system call will use. The function prints out the child PID using the getpid() system call:

static char child_stack[1024*1024]; 
 
int main(int argc, char *argv[]) 
{ 
    pid_t child_pid; 
 
    child_pid = clone(childFunc, child_stack + 
    (1024*1024),      
    CLONE_NEWPID | SIGCHLD, NULL); 
 
    printf("PID of cloned process: %ld\n", (long) child_pid); 
    waitpid(child_pid, NULL, 0); 
    exit(EXIT_SUCCESS); 
} 

In the main() function, we specify the stack size and call clone(), passing the child function childFunc, the stack pointer, the CLONE_NEWPID flag, and the SIGCHLD signal. The CLONE_NEWPID flag instructs clone() to create a new PID namespace and the SIGCHLD flag notifies the parent process when one of its children terminates. The parent process will block on waitpid() if the child process has not terminated.

Compile and then run the program with the following:

root@server:~# gcc pid_namespace.c -o pid_namespace
root@server:~# ./pid_namespace
PID of cloned process: 17705
Process ID in child  = 1
root@server:~#

From the output, we can see that the child process has a PID of 1 inside its namespace and 17705 otherwise.

Note

Note that error handling has been omitted from the code examples for brevity.

User namespaces

The user namespaces allow a process inside a namespace to have a different user and group ID than that in the default namespace. In the context of LXC, this allows for a process to run as root inside the container, while having a non-privileged ID outside. This adds a thin layer of security, because braking out for the container will result in a non-privileged user. This is possible because of kernel 3.8, which introduced the ability for non-privileged processes to create user namespaces.

To create a new user namespace as a non-privileged user and have root inside, we can use the unshare utility. Let's install the latest version from source:

root@ubuntu:~# cd /usr/src/
root@ubuntu:/usr/src# wget https://www.kernel.org/pub/linux/utils/util-linux/v2.28/util-linux-2.28.tar.gz
root@ubuntu:/usr/src# tar zxfv util-linux-2.28.tar.gz
root@ubuntu:/usr/src# cd util-linux-2.28/
root@ubuntu:/usr/src/util-linux-2.28# ./configure
root@ubuntu:/usr/src/util-linux-2.28# make && make install
root@ubuntu:/usr/src/util-linux-2.28# unshare --map-root-user --user sh -c whoami
root
root@ubuntu:/usr/src/util-linux-2.28#

We can also use the clone() system call with the CLONE_NEWUSER flag to create a process in a user namespace, as demonstrated by the following program:

#define _GNU_SOURCE 
#include <stdlib.h> 
#include <stdio.h> 
#include <signal.h> 
#include <sched.h> 
 
static int childFunc(void *arg) 
{ 
    printf("UID inside the namespace is %ld\n", (long) 
    geteuid()); 
    printf("GID inside the namespace is %ld\n", (long) 
    getegid()); 
} 
 
static char child_stack[1024*1024]; 
 
int main(int argc, char *argv[]) 
{ 
    pid_t child_pid; 
 
    child_pid = clone(childFunc, child_stack +  
    (1024*1024),        
    CLONE_NEWUSER | SIGCHLD, NULL); 
 
    printf("UID outside the namespace is %ld\n", (long)       
    geteuid()); 
    printf("GID outside the namespace is %ld\n", (long)      
    getegid()); 
    waitpid(child_pid, NULL, 0); 
    exit(EXIT_SUCCESS); 
} 

After compilation and execution, the output looks similar to this when run as root - UID of 0:

root@server:~# gcc user_namespace.c -o user_namespace
root@server:~# ./user_namespace
UID outside the namespace is 0
GID outside the namespace is 0
UID inside the namespace is 65534
GID inside the namespace is 65534
root@server:~#

Network namespaces

Network namespaces provide isolation of the networking resources, such as network devices, addresses, routes, and firewall rules. This effectively creates a logical copy of the network stack, allowing multiple processes to listen on the same port from multiple namespaces. This is the foundation of networking in LXC and there are quite a lot of other use cases where this can come in handy.

The iproute2 package provides very useful userspace tools that we can use to experiment with the network namespaces, and is installed by default on almost all Linux systems.

There's always the default network namespace, referred to as the root namespace, where all network interfaces are initially assigned. To list the network interfaces that belong to the default namespace run the following command:

root@server:~# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
    link/ether 0e:d5:0e:b0:a3:47 brd ff:ff:ff:ff:ff:ff
root@server:~#

In this case, there are two interfaces – lo and eth0.

To list their configuration, we can run the following:

root@server:~# ip a s
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
    valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 0e:d5:0e:b0:a3:47 brd ff:ff:ff:ff:ff:ff
inet 10.1.32.40/24 brd 10.1.32.255 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::cd5:eff:feb0:a347/64 scope link
    valid_lft forever preferred_lft forever
root@server:~#

Also, to list the routes from the root network namespace, execute the following:

root@server:~# ip r s
default via 10.1.32.1 dev eth0
10.1.32.0/24 dev eth0  proto kernel  scope link  src 10.1.32.40
root@server:~#

Let's create two new network namespaces called ns1 and ns2 and list them:

root@server:~# ip netns add ns1
root@server:~# ip netns add ns2
root@server:~# ip netns
ns2
ns1
root@server:~#

Now that we have the new network namespaces, we can execute commands inside them:

root@server:~# ip netns exec ns1 ip link
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
root@server:~#

The preceding output shows that in the ns1 namespace, there's only one network interface, the loopback - lo interface, and it's in a DOWN state.

We can also start a new bash session inside the namespace and list the interfaces in a similar way:

root@server:~# ip netns exec ns1 bash
root@server:~# ip link
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default
 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
root@server:~# exit
root@server:~#

This is more convenient for running multiple commands than specifying each, one at a time. The two network namespaces are not of much use if not connected to anything, so let's connect them to each other. To do this we'll use a software bridge called Open vSwitch.

Open vSwitch works just as a regular network bridge and then it forwards frames between virtual ports that we define. Virtual machines such as KVM, Xen, and LXC or Docker containers can then be connected to it.

Most Debian-based distributions such as Ubuntu provide a package, so let's install that:

root@server:~# apt-get install -y openvswitch-switch
root@server:~#

This installs and starts the Open vSwitch daemon. Time to create the bridge; we'll name it OVS-1:

root@server:~# ovs-vsctl add-br OVS-1
root@server:~# ovs-vsctl show
0ea38b4f-8943-4d5b-8d80-62ccb73ec9ec
Bridge "OVS-1"
    Port "OVS-1"
        Interface "OVS-1"
            type: internal
ovs_version: "2.0.2"
root@server:~#

Note

If you would like to experiment with the latest version of Open vSwitch, you can download the source code from http://openvswitch.org/download/ and compile it.

The newly created bridge can now be seen in the root namespace:

root@server:~# ip a s OVS-1
4: OVS-1: <BROADCAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default
link/ether 9a:4b:56:97:3b:46 brd ff:ff:ff:ff:ff:ff
inet6 fe80::f0d9:78ff:fe72:3d77/64 scope link
   valid_lft forever preferred_lft forever
root@server:~#

In order to connect both network namespaces, let's first create a virtual pair of interfaces for each namespace:

root@server:~# ip link add eth1-ns1 type veth peer name veth-ns1
root@server:~# ip link add eth1-ns2 type veth peer name veth-ns2
root@server:~#

The preceding two commands create four virtual interfaces eth1-ns1, eth1-ns2 and veth-ns1, veth-ns2. The names are arbitrary.

To list all interfaces that are part of the root network namespace, run:

root@server:~# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
link/ether 0e:d5:0e:b0:a3:47 brd ff:ff:ff:ff:ff:ff
3: ovs-system: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default
link/ether 82:bf:52:d3:de:7e brd ff:ff:ff:ff:ff:ff
4: OVS-1: <BROADCAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN mode DEFAULT group default
link/ether 9a:4b:56:97:3b:46 brd ff:ff:ff:ff:ff:ff
5: veth-ns1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 1a:7c:74:48:73:a9 brd ff:ff:ff:ff:ff:ff
6: eth1-ns1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 8e:99:3f:b8:43:31 brd ff:ff:ff:ff:ff:ff
7: veth-ns2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 5a:0d:34:87:ea:96 brd ff:ff:ff:ff:ff:ff
8: eth1-ns2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether fa:71:b8:a1:7f:85 brd ff:ff:ff:ff:ff:ff
root@server:~#

Let's assign the eth1-ns1 and eth1-ns2 interfaces to the ns1 and ns2 namespaces:

root@server:~# ip link set eth1-ns1 netns ns1
root@server:~# ip link set eth1-ns2 netns ns2

Also, confirm they are visible from inside each network namespace:

root@server:~# ip netns exec ns1 ip link
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
6: eth1-ns1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 8e:99:3f:b8:43:31 brd ff:ff:ff:ff:ff:ff
root@server:~#
root@server:~# ip netns exec ns2 ip link
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
8: eth1-ns2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether fa:71:b8:a1:7f:85 brd ff:ff:ff:ff:ff:ff
root@server:~#

Notice, how each network namespace now has two interfaces assigned – loopback and eth1-ns*.

If we list the devices from the root namespace, we should see that the interfaces we just moved to ns1 and ns2 namespaces are no longer visible:

root@server:~# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
link/ether 0e:d5:0e:b0:a3:47 brd ff:ff:ff:ff:ff:ff
3: ovs-system: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default
link/ether 82:bf:52:d3:de:7e brd ff:ff:ff:ff:ff:ff
4: OVS-1: <BROADCAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN mode DEFAULT group default
link/ether 9a:4b:56:97:3b:46 brd ff:ff:ff:ff:ff:ff
5: veth-ns1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 1a:7c:74:48:73:a9 brd ff:ff:ff:ff:ff:ff
7: veth-ns2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 5a:0d:34:87:ea:96 brd ff:ff:ff:ff:ff:ff
root@server:~#

It's time to connect the other end of the two virtual pipes, the veth-ns1 and veth-ns2 interfaces to the bridge:

root@server:~# ovs-vsctl add-port OVS-1 veth-ns1
root@server:~# ovs-vsctl add-port OVS-1 veth-ns2
root@server:~# ovs-vsctl show
0ea38b4f-8943-4d5b-8d80-62ccb73ec9ec
Bridge "OVS-1"
    Port "OVS-1"
        Interface "OVS-1"
            type: internal
    Port "veth-ns1"
        Interface "veth-ns1"
    Port "veth-ns2"
        Interface "veth-ns2"
ovs_version: "2.0.2"
root@server:~#

From the preceding output, it's apparent that the bridge now has two ports, veth-ns1 and veth-ns2.

The last thing left to do is bring the network interfaces up and assign IP addresses:

root@server:~# ip link set veth-ns1 up
root@server:~# ip link set veth-ns2 up
root@server:~# ip netns exec ns1 ip link set dev lo up
root@server:~# ip netns exec ns1 ip link set dev eth1-ns1 up
root@server:~# ip netns exec ns1 ip address add 192.168.0.1/24 dev eth1-ns1
root@server:~# ip netns exec ns1 ip a s
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
   valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
   valid_lft forever preferred_lft forever
6: eth1-ns1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether 8e:99:3f:b8:43:31 brd ff:ff:ff:ff:ff:ff
inet 192.168.0.1/24 scope global eth1-ns1
   valid_lft forever preferred_lft forever
inet6 fe80::8c99:3fff:feb8:4331/64 scope link
   valid_lft forever preferred_lft forever
root@server:~#

Similarly for the ns2 namespace:

root@server:~# ip netns exec ns2 ip link set dev lo up
root@server:~# ip netns exec ns2 ip link set dev eth1-ns2 up
root@server:~# ip netns exec ns2 ip address add 192.168.0.2/24 dev eth1-ns2
root@server:~# ip netns exec ns2 ip a s
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
   valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
   valid_lft forever preferred_lft forever
8: eth1-ns2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether fa:71:b8:a1:7f:85 brd ff:ff:ff:ff:ff:ff
inet 192.168.0.2/24 scope global eth1-ns2
   valid_lft forever preferred_lft forever
inet6 fe80::f871:b8ff:fea1:7f85/64 scope link
   valid_lft forever preferred_lft forever
root@server:~#
Network namespaces

With this, we established a connection between both ns1 and ns2 network namespaces through the Open vSwitch bridge. To confirm, let's use ping:

root@server:~# ip netns exec ns1 ping -c 3 192.168.0.2
PING 192.168.0.2 (192.168.0.2) 56(84) bytes of data.
64 bytes from 192.168.0.2: icmp_seq=1 ttl=64 time=0.414 ms
64 bytes from 192.168.0.2: icmp_seq=2 ttl=64 time=0.027 ms
64 bytes from 192.168.0.2: icmp_seq=3 ttl=64 time=0.030 ms
--- 192.168.0.2 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 1998ms
rtt min/avg/max/mdev = 0.027/0.157/0.414/0.181 ms
root@server:~#
root@server:~# ip netns exec ns2 ping -c 3 192.168.0.1
PING 192.168.0.1 (192.168.0.1) 56(84) bytes of data.
64 bytes from 192.168.0.1: icmp_seq=1 ttl=64 time=0.150 ms
64 bytes from 192.168.0.1: icmp_seq=2 ttl=64 time=0.025 ms
64 bytes from 192.168.0.1: icmp_seq=3 ttl=64 time=0.027 ms
--- 192.168.0.1 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 1999ms
rtt min/avg/max/mdev = 0.025/0.067/0.150/0.058 ms
root@server:~#

Open vSwitch allows for assigning VLAN tags to network interfaces, resulting in traffic isolation between namespaces. This can be helpful in a scenario where you have multiple namespaces and you want to have connectivity between some of them.

The following example demonstrates how to tag the virtual interfaces on the ns1 and ns2 namespaces, so that the traffic will not be visible from each of the two network namespaces:

root@server:~# ovs-vsctl set port veth-ns1 tag=100
root@server:~# ovs-vsctl set port veth-ns2 tag=200
root@server:~# ovs-vsctl show
0ea38b4f-8943-4d5b-8d80-62ccb73ec9ec
Bridge "OVS-1"
    Port "OVS-1"
        Interface "OVS-1"
            type: internal
    Port "veth-ns1"
        tag: 100
        Interface "veth-ns1"
    Port "veth-ns2"
        tag: 200
        Interface "veth-ns2"
ovs_version: "2.0.2"
root@server:~#

Both the namespaces should now be isolated in their own VLANs and ping should fail:

root@server:~# ip netns exec ns1 ping -c 3 192.168.0.2
PING 192.168.0.2 (192.168.0.2) 56(84) bytes of data.
--- 192.168.0.2 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 1999ms
root@server:~# ip netns exec ns2 ping -c 3 192.168.0.1
PING 192.168.0.1 (192.168.0.1) 56(84) bytes of data.
--- 192.168.0.1 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 1999ms
root@server:~#

We can also use the unshare utility that we saw in the mount and UTC namespaces examples to create a new network namespace:

root@server:~# unshare --net /bin/bash
root@server:~# ip a s
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
root@server:~# exit
root@server

Resource management with cgroups

Cgroups are kernel features that allows fine-grained control over resource allocation for a single process, or a group of processes, called tasks. In the context of LXC this is quite important, because it makes it possible to assign limits to how much memory, CPU time, or I/O, any given container can use.

The cgroups we are most interested in are described in the following table:

Subsystem

Description

Defined in

cpu

Allocates CPU time for tasks

kernel/sched/core.c

cpuacct

Accounts for CPU usage

kernel/sched/core.c

cpuset

Assigns CPU cores to tasks

kernel/cpuset.c

memory

Allocates memory for tasks

mm/memcontrol.c

blkio

Limits the I/O access to devices

block/blk-cgroup.c

devices

Allows/denies access to devices

security/device_cgroup.c

freezer

Suspends/resumes tasks

kernel/cgroup_freezer.c

net_cls

Tags network packets

net/sched/cls_cgroup.c

net_prio

Prioritizes network traffic

net/core/netprio_cgroup.c

hugetlb

Limits the HugeTLB

mm/hugetlb_cgroup.c

Cgroups are organized in hierarchies, represented as directories in a Virtual File System (VFS). Similar to process hierarchies, where every process is a descendent of the init or systemd process, cgroups inherit some of the properties of their parents. Multiple cgroups hierarchies can exist on the system, each one representing a single or group of resources. It is possible to have hierarchies that combine two or more subsystems, for example, memory and I/O, and tasks assigned to a group will have limits applied on those resources.

Note

If you are interested in how the different subsystems are implemented in the kernel, install the kernel source and have a look at the C files, shown in the third column of the table.

The following diagram helps visualize a single hierarchy that has two subsystems—CPU and I/O—attached to it:

Resource management with cgroups

Cgroups can be used in two ways:

  • By manually manipulating files and directories on a mounted VFS
  • Using userspace tools provided by various packages such as cgroup-bin on Debian/Ubuntu and libcgroup on RHEL/CentOS

Let's have a look at few practical examples on how to use cgroups to limit resources. This will help us get a better understanding of how containers work.

Limiting I/O throughput

Let's assume we have two applications running on a server that are heavily I/O bound: app1 and app2. We would like to give more bandwidth to app1 during the day and to app2 during the night. This type of I/O throughput prioritization can be accomplished using the blkio subsystem.

First, let's attach the blkio subsystem by mounting the cgroup VFS:

root@server:~# mkdir -p /cgroup/blkio
root@server:~# mount -t cgroup -o blkio blkio /cgroup/blkio
root@server:~# cat /proc/mounts | grep cgroup
blkio /cgroup/blkio cgroup rw, relatime, blkio, crelease_agent=/run/cgmanager/agents/cgm-release-agent.blkio 0 0
root@server:~#

Next, create two priority groups, which will be part of the same blkio hierarchy:

root@server:~# mkdir /cgroup/blkio/high_io
root@server:~# mkdir /cgroup/blkio/low_io
root@server:~#

We need to acquire the PIDs of the app1 and app2 processes and assign them to the high_io and low_io groups:

root@server:~# pidof app1 | while read PID; do echo $PID >> /cgroup/blkio/high_io/tasks; done 
root@server:~# pidof app2 | while read PID; do echo $PID >> /cgroup/blkio/low_io/tasks; done
root@server:~#
Limiting I/O throughput

The blkio hierarchy we've created

The tasks file is where we define what processes/tasks the limit should be applied on.

Finally, let's set a ratio of 10:1 for the high_io and low_io cgroups. Tasks in those cgroups will immediately use only the resources made available to them:

root@server:~# echo 1000 > /cgroup/blkio/high_io/blkio.weight
root@server:~# echo 100 > /cgroup/blkio/low_io/blkio.weight
root@server:~#

The blkio.weight file defines the weight of I/O access available to a process or group of processes, with values ranging from 100 to 1,000. In this example, the values of 1000 and 100 create a ratio of 10:1.

With this, the low priority application, app2 will use only about 10 percent of the I/O operations available, whereas the high priority application, app1, will use about 90 percent.

If you list the contents of the high_io directory on Ubuntu you will see the following files:

root@server:~# ls -la /cgroup/blkio/high_io/
drwxr-xr-x 2 root root 0 Aug 24 16:14 .
drwxr-xr-x 4 root root 0 Aug 19 21:14 ..
-r--r--r-- 1 root root 0 Aug 24 16:14 blkio.io_merged
-r--r--r-- 1 root root 0 Aug 24 16:14 blkio.io_merged_recursive
-r--r--r-- 1 root root 0 Aug 24 16:14 blkio.io_queued
-r--r--r-- 1 root root 0 Aug 24 16:14 blkio.io_queued_recursive
-r--r--r-- 1 root root 0 Aug 24 16:14 blkio.io_service_bytes
-r--r--r-- 1 root root 0 Aug 24 16:14 blkio.io_service_bytes_recursive
-r--r--r-- 1 root root 0 Aug 24 16:14 blkio.io_serviced
-r--r--r-- 1 root root 0 Aug 24 16:14 blkio.io_serviced_recursive
-r--r--r-- 1 root root 0 Aug 24 16:14 blkio.io_service_time
-r--r--r-- 1 root root 0 Aug 24 16:14 blkio.io_service_time_recursive
-r--r--r-- 1 root root 0 Aug 24 16:14 blkio.io_wait_time
-r--r--r-- 1 root root 0 Aug 24 16:14 blkio.io_wait_time_recursive
-rw-r--r-- 1 root root 0 Aug 24 16:14 blkio.leaf_weight
-rw-r--r-- 1 root root 0 Aug 24 16:14 blkio.leaf_weight_device
--w------- 1 root root 0 Aug 24 16:14 blkio.reset_stats
-r--r--r-- 1 root root 0 Aug 24 16:14 blkio.sectors
-r--r--r-- 1 root root 0 Aug 24 16:14 blkio.sectors_recursive
-r--r--r-- 1 root root 0 Aug 24 16:14 blkio.throttle.io_service_bytes
-r--r--r-- 1 root root 0 Aug 24 16:14 blkio.throttle.io_serviced
-rw-r--r-- 1 root root 0 Aug 24 16:14 blkio.throttle.read_bps_device
-rw-r--r-- 1 root root 0 Aug 24 16:14 blkio.throttle.read_iops_device
-rw-r--r-- 1 root root 0 Aug 24 16:14 blkio.throttle.write_bps_device
-rw-r--r-- 1 root root 0 Aug 24 16:14 blkio.throttle.write_iops_device
-r--r--r-- 1 root root 0 Aug 24 16:14 blkio.time
-r--r--r-- 1 root root 0 Aug 24 16:14 blkio.time_recursive
-rw-r--r-- 1 root root 0 Aug 24 16:49 blkio.weight
-rw-r--r-- 1 root root 0 Aug 24 17:01 blkio.weight_device
-rw-r--r-- 1 root root 0 Aug 24 16:14 cgroup.clone_children
--w--w--w- 1 root root 0 Aug 24 16:14 cgroup.event_control
-rw-r--r-- 1 root root 0 Aug 24 16:14 cgroup.procs
-rw-r--r-- 1 root root 0 Aug 24 16:14 notify_on_release
-rw-r--r-- 1 root root 0 Aug 24 16:14 tasks
root@server:~#

From the preceding output you can see that only some files are writeable. This depends on various OS settings, such as what I/O scheduler is being used.

We've already seen what the tasks and blkio.weight files are used for. The following is a short description of the most commonly used files in the blkio subsystem:

File

Description

blkio.io_merged

Total number of reads/writes, sync, or async merged into requests

blkio.io_queued

Total number of read/write, sync, or async requests queued up at any given time

blkio.io_service_bytes

The number of bytes transferred to or from the specified device

blkio.io_serviced

The number of I/O operations issued to the specified device

blkio.io_service_time

Total amount of time between request dispatch and request completion in nanoseconds for the specified device

blkio.io_wait_time

Total amount of time the I/O operations spent waiting in the scheduler queues for the specified device

blkio.leaf_weight

Similar to blkio.weight and can be applied to the Completely Fair Queuing (CFQ) I/O scheduler

blkio.reset_stats

Writing an integer to this file will reset all statistics

blkio.sectors

The number of sectors transferred to or from the specified device

blkio.throttle.io_service_bytes

The number of bytes transferred to or from the disk

blkio.throttle.io_serviced

The number of I/O operations issued to the specified disk

blkio.time

The disk time allocated to a device in milliseconds

blkio.weight

Specifies weight for a cgroup hierarchy

blkio.weight_device

Same as blkio.weight, but specifies a block device to apply the limit on

tasks

Attach tasks to the cgroup

Tip

One thing to keep in mind is that writing to the files directly to make changes will not persist after the server restarts. Later in this chapter, you will learn how to use the userspace tools to generate persistent configuration files.

Limiting memory usage

The memory subsystem controls how much memory is presented to and available for use by processes. This can be particularly useful in multitenant environments where better control over how much memory a user process can utilize is needed, or to limit memory hungry applications. Containerized solutions like LXC can use the memory subsystem to manage the size of the instances, without needing to restart the entire container.

The memory subsystem performs resource accounting, such as tracking the utilization of anonymous pages, file caches, swap caches, and general hierarchical accounting, all of which presents an overhead. Because of this, the memory cgroup is disabled by default on some Linux distributions. If the following commands below fail you'll need to enable it, by specifying the following GRUB parameter and restarting:

root@server:~# vim /etc/default/grub
RUB_CMDLINE_LINUX_DEFAULT="cgroup_enable=memory"
root@server:~# grub-update && reboot

First, let's mount the memory cgroup:

root@server:~# mkdir -p /cgroup/memory
root@server:~# mount -t cgroup -o memory memory /cgroup/memory
root@server:~# cat /proc/mounts | grep memory
memory /cgroup/memory cgroup rw, relatime, memory, release_agent=/run/cgmanager/agents/cgm-release-agent.memory 0 0
root@server:~#

Then set the app1 memory to 1 GB:

root@server:~# mkdir /cgroup/memory/app1
root@server:~# echo 1G > /cgroup/memory/app1/memory.limit_in_bytes
root@server:~# cat /cgroup/memory/app1/memory.limit_in_bytes
1073741824
root@server:~# pidof app1 | while read PID; do echo $PID >> /cgroup/memory/app1/tasks; done
root@server:~#
Limiting memory usage

The memory hierarchy for the app1 process

Similar to the blkio subsystem, the tasks file is used to specify the PID of the processes we are adding to the cgroup hierarchy, and the memory.limit_in_bytes specifies how much memory is to be made available in bytes.

The app1 memory hierarchy contains the following files:

root@server:~# ls -la /cgroup/memory/app1/
drwxr-xr-x 2 root root 0 Aug 24 22:05 .
drwxr-xr-x 3 root root 0 Aug 19 21:02 ..
-rw-r--r-- 1 root root 0 Aug 24 22:05 cgroup.clone_children
--w--w--w- 1 root root 0 Aug 24 22:05 cgroup.event_control
-rw-r--r-- 1 root root 0 Aug 24 22:05 cgroup.procs
-rw-r--r-- 1 root root 0 Aug 24 22:05 memory.failcnt
--w------- 1 root root 0 Aug 24 22:05 memory.force_empty
-rw-r--r-- 1 root root 0 Aug 24 22:05 memory.kmem.failcnt
-rw-r--r-- 1 root root 0 Aug 24 22:05 memory.kmem.limit_in_bytes
-rw-r--r-- 1 root root 0 Aug 24 22:05 memory.kmem.max_usage_in_bytes
-r--r--r-- 1 root root 0 Aug 24 22:05 memory.kmem.slabinfo
-rw-r--r-- 1 root root 0 Aug 24 22:05 memory.kmem.tcp.failcnt
-rw-r--r-- 1 root root 0 Aug 24 22:05 memory.kmem.tcp.limit_in_bytes
-rw-r--r-- 1 root root 0 Aug 24 22:05 memory.kmem.tcp.max_usage_in_bytes
-r--r--r-- 1 root root 0 Aug 24 22:05 memory.kmem.tcp.usage_in_bytes
-r--r--r-- 1 root root 0 Aug 24 22:05 memory.kmem.usage_in_bytes
-rw-r--r-- 1 root root 0 Aug 24 22:05 memory.limit_in_bytes
-rw-r--r-- 1 root root 0 Aug 24 22:05 memory.max_usage_in_bytes
-rw-r--r-- 1 root root 0 Aug 24 22:05 memory.move_charge_at_immigrate
-r--r--r-- 1 root root 0 Aug 24 22:05 memory.numa_stat
-rw-r--r-- 1 root root 0 Aug 24 22:05 memory.oom_control
---------- 1 root root 0 Aug 24 22:05 memory.pressure_level
-rw-r--r-- 1 root root 0 Aug 24 22:05 memory.soft_limit_in_bytes
-r--r--r-- 1 root root 0 Aug 24 22:05 memory.stat
-rw-r--r-- 1 root root 0 Aug 24 22:05 memory.swappiness
-r--r--r-- 1 root root 0 Aug 24 22:05 memory.usage_in_bytes
-rw-r--r-- 1 root root 0 Aug 24 22:05 memory.use_hierarchy
-rw-r--r-- 1 root root 0 Aug 24 22:05 tasks
root@server:~#

The files and their function in the memory subsystem are described in the following table:

File

Description

memory.failcnt

Shows the total number of memory limit hits

memory.force_empty

If set to 0, frees memory used by tasks

memory.kmem.failcnt

Shows the total number of kernel memory limit hits

memory.kmem.limit_in_bytes

Sets or shows kernel memory hard limit

memory.kmem.max_usage_in_bytes

Shows maximum kernel memory usage

memory.kmem.tcp.failcnt

Shows the number of TCP buffer memory limit hits

memory.kmem.tcp.limit_in_bytes

Sets or shows hard limit for TCP buffer memory

memory.kmem.tcp.max_usage_in_bytes

Shows maximum TCP buffer memory usage

memory.kmem.tcp.usage_in_bytes

Shows current TCP buffer memory

memory.kmem.usage_in_bytes

Shows current kernel memory

memory.limit_in_bytes

Sets or shows memory usage limit

memory.max_usage_in_bytes

Shows maximum memory usage

memory.move_charge_at_immigrate

Sets or shows controls of moving charges

memory.numa_stat

Shows the number of memory usage per NUMA node

memory.oom_control

Sets or shows the OOM controls

memory.pressure_level

Sets memory pressure notifications

memory.soft_limit_in_bytes

Sets or shows soft limit of memory usage

memory.stat

Shows various statistics

memory.swappiness

Sets or shows swappiness level

memory.usage_in_bytes

Shows current memory usage

memory.use_hierarchy

Sets memory reclamation from child processes

tasks

Attaches tasks to the cgroup

Limiting the memory available to a process might trigger the Out of Memory (OOM) killer, which might kill the running task. If this is not the desired behavior and you prefer the process to be suspended waiting for memory to be freed, the OOM killer can be disabled:

root@server:~# cat /cgroup/memory/app1/memory.oom_control
oom_kill_disable 0
under_oom 0
root@server:~# echo 1 > /cgroup/memory/app1/memory.oom_control
root@server:~#

The memory cgroup presents a wide slew of accounting statistics in the memory.stat file, which can be of interest:

root@server:~# head /cgroup/memory/app1/memory.stat
cache 43325     # Number of bytes of page cache memory
rss 55d43       # Number of bytes of anonymous and swap cache memory
rss_huge 0      # Number of anonymous transparent hugepages
mapped_file 2   # Number of bytes of mapped file
writeback 0     # Number of bytes of cache queued for syncing
pgpgin 0        # Number of charging events to the memory cgroup
pgpgout 0       # Number of uncharging events to the memory cgroup
pgfault 0       # Total number of page faults
pgmajfault 0    # Number of major page faults
inactive_anon 0 # Anonymous and swap cache memory on inactive LRU list

If you need to start a new task in the app1 memory hierarchy you can move the current shell process into the tasks file, and all other processes started in this shell will be direct descendants and inherit the same cgroup properties:

root@server:~# echo $$ > /cgroup/memory/app1/tasks
root@server:~# echo "The memory limit is now applied to all processes started from this shell"

The cpu and cpuset subsystems

The cpu subsystem schedules CPU time to cgroup hierarchies and their tasks. It provides finer control over CPU execution time than the default behavior of the CFS.

The cpuset subsystem allows for assigning CPU cores to a set of tasks, similar to the taskset command in Linux.

The main benefits that the cpu and cpuset subsystems provide are better utilization per processor core for highly CPU bound applications. They also allow for distributing load between cores that are otherwise idle at certain times of the day. In the context of multitenant environments, running many LXC containers, cpu and cpuset cgroups allow for creating different instance sizes and container flavors, for example exposing only a single core per container, with 40 percent scheduled work time.

As an example, let's assume we have two processes app1 and app2, and we would like app1 to use 60 percent of the CPU time and app2 only 40 percent. We start by mounting the cgroup VFS:

root@server:~# mkdir -p /cgroup/cpu
root@server:~# mount -t cgroup -o cpu cpu /cgroup/cpu
root@server:~# cat /proc/mounts | grep cpu
cpu /cgroup/cpu cgroup rw, relatime, cpu, release_agent=/run/cgmanager/agents/cgm-release-agent.cpu 0 0

Then we create two child hierarchies:

root@server:~# mkdir /cgroup/cpu/limit_60_percent
root@server:~# mkdir /cgroup/cpu/limit_40_percent

Also assign CPU shares for each, where app1 will get 60 percent and app2 will get 40 percent of the scheduled time:

root@server:~# echo 600 > /cgroup/cpu/limit_60_percent/cpu.shares
root@server:~# echo 400 > /cgroup/cpu/limit_40_percent/cpu.shares

Finally, we move the PIDs in the tasks files:

root@server:~# pidof app1 | while read PID; do echo $PID >> /cgroup/cpu/limit_60_percent/tasks; done
root@server:~# pidof app2 | while read PID; do echo $PID >> /cgroup/cpu/limit_40_percent/tasks; done
root@server:~#

The cpu subsystem contains the following control files:

root@server:~# ls -la /cgroup/cpu/limit_60_percent/
drwxr-xr-x 2 root root 0 Aug 25 15:13 .
drwxr-xr-x 4 root root 0 Aug 19 21:02 ..
-rw-r--r-- 1 root root 0 Aug 25 15:13 cgroup.clone_children
--w--w--w- 1 root root 0 Aug 25 15:13 cgroup.event_control
-rw-r--r-- 1 root root 0 Aug 25 15:13 cgroup.procs
-rw-r--r-- 1 root root 0 Aug 25 15:13 cpu.cfs_period_us
-rw-r--r-- 1 root root 0 Aug 25 15:13 cpu.cfs_quota_us
-rw-r--r-- 1 root root 0 Aug 25 15:14 cpu.shares
-r--r--r-- 1 root root 0 Aug 25 15:13 cpu.stat
-rw-r--r-- 1 root root 0 Aug 25 15:13 notify_on_release
-rw-r--r-- 1 root root 0 Aug 25 15:13 tasks
root@server:~#

Here's a brief explanation of each:

File

Description

cpu.cfs_period_us

CPU resource reallocation in microseconds

cpu.cfs_quota_us

Run duration of tasks in microseconds during one cpu.cfs_perious_us period

cpu.shares

Relative share of CPU time available to the tasks

cpu.stat

Shows CPU time statistics

tasks

Attaches tasks to the cgroup

The cpu.stat file is of particular interest:

root@server:~# cat /cgroup/cpu/limit_60_percent/cpu.stat
nr_periods 0        # number of elapsed period intervals, as specified in
                # cpu.cfs_period_us
nr_throttled 0      # number of times a task was not scheduled to run
                # because of quota limit
throttled_time 0    # total time in nanoseconds for which tasks have been
                # throttled
root@server:~#

To demonstrate how the cpuset subsystem works, let's create cpuset hierarchies named app1, containing CPUs 0 and 1. The app2 cgroup will contain only CPU 1:

root@server:~# mkdir /cgroup/cpuset
root@server:~# mount -t cgroup -o cpuset cpuset /cgroup/cpuset
root@server:~# mkdir /cgroup/cpuset/app{1..2}
root@server:~# echo 0-1 > /cgroup/cpuset/app1/cpuset.cpus
root@server:~# echo 1 > /cgroup/cpuset/app2/cpuset.cpus
root@server:~# pidof app1 | while read PID; do echo $PID >> /cgroup/cpuset/app1/tasks limit_60_percent/tasks; done
root@server:~# pidof app2 | while read PID; do echo $PID >> /cgroup/cpuset/app2/tasks limit_40_percent/tasks; done
root@server:~#

To check if the app1 process is pinned to CPU 0 and 1, we can use:

root@server:~# taskset -c -p $(pidof app1)
pid 8052's current affinity list: 0,1
root@server:~# taskset -c -p $(pidof app2)
pid 8052's current affinity list: 1
root@server:~#

The cpuset app1 hierarchy contains the following files:

root@server:~# ls -la /cgroup/cpuset/app1/
drwxr-xr-x 2 root root 0 Aug 25 16:47 .
drwxr-xr-x 5 root root 0 Aug 19 21:02 ..
-rw-r--r-- 1 root root 0 Aug 25 16:47 cgroup.clone_children
--w--w--w- 1 root root 0 Aug 25 16:47 cgroup.event_control
-rw-r--r-- 1 root root 0 Aug 25 16:47 cgroup.procs
-rw-r--r-- 1 root root 0 Aug 25 16:47 cpuset.cpu_exclusive
-rw-r--r-- 1 root root 0 Aug 25 17:57 cpuset.cpus
-rw-r--r-- 1 root root 0 Aug 25 16:47 cpuset.mem_exclusive
-rw-r--r-- 1 root root 0 Aug 25 16:47 cpuset.mem_hardwall
-rw-r--r-- 1 root root 0 Aug 25 16:47 cpuset.memory_migrate
-r--r--r-- 1 root root 0 Aug 25 16:47 cpuset.memory_pressure
-rw-r--r-- 1 root root 0 Aug 25 16:47 cpuset.memory_spread_page
-rw-r--r-- 1 root root 0 Aug 25 16:47 cpuset.memory_spread_slab
-rw-r--r-- 1 root root 0 Aug 25 16:47 cpuset.mems
-rw-r--r-- 1 root root 0 Aug 25 16:47 cpuset.sched_load_balance
-rw-r--r-- 1 root root 0 Aug 25 16:47 cpuset.sched_relax_domain_level
-rw-r--r-- 1 root root 0 Aug 25 16:47 notify_on_release
-rw-r--r-- 1 root root 0 Aug 25 17:13 tasks
root@server:~#

A brief description of the control files is as follows:

File

Description

cpuset.cpu_exclusive

Checks if other cpuset hierarchies share the settings defined in the current group

cpuset.cpus

List of the physical numbers of the CPUs on which processes in that cpuset are allowed to execute

cpuset.mem_exclusive

Should the cpuset have exclusive use of its memory nodes

cpuset.mem_hardwall

Checks if each tasks' user allocation be kept separate

cpuset.memory_migrate

Checks if a page in memory should migrate to a new node if the values in cpuset.mems change

cpuset.memory_pressure

Contains running average of the memory pressure created by the processes

cpuset.memory_spread_page

Checks if filesystem buffers should spread evenly across the memory nodes

cpuset.memory_spread_slab

Checks if kernel slab caches for file I/O operations should spread evenly across the cpuset

cpuset.mems

Specifies the memory nodes that tasks in this cgroup are permitted to access

cpuset.sched_load_balance

Checks if the kernel balance should load across the CPUs in the cpuset by moving processes from overloaded CPUs to less utilized CPUs

cpuset.sched_relax_domain_level

Contains the width of the range of CPUs across which the kernel should attempt to balance loads

notify_on_release

Checks if the hierarchy should receive special handling after it is released and no process are using it

tasks

Attaches tasks to the cgroup

The cgroup freezer subsystem

The freezer subsystem can be used to suspend the current state of running tasks for the purposes of analyzing them, or to create a checkpoint that can be used to migrate the process to a different server. Another use case is when a process is negatively impacting the system and needs to be temporarily paused, without losing its current state data.

The next example shows how to suspend the execution of the top process, check its state, and then resume it.

First, mount the freezer subsystem and create the new hierarchy:

root@server:~# mkdir /cgroup/freezer
root@server:~# mount -t cgroup -o freezer freezer /cgroup/freezer
root@server:~# mkdir /cgroup/freezer/frozen_group
root@server:~# cat /proc/mounts | grep freezer
freezer /cgroup/freezer cgroup rw,relatime,freezer,release_agent=/run/cgmanager/agents/cgm-release-agent.freezer 0 0
root@server:~#

In a new terminal, start the top process and observe how it periodically refreshes. Back in the original terminal, add the PID of top to the frozen_group task file and observe its state:

root@server:~# echo 25731 > /cgroup/freezer/frozen_group/tasks
root@server:~# cat /cgroup/freezer/frozen_group/freezer.state
THAWED
root@server:~#

To freeze the process, echo the following:

root@server:~# echo FROZEN > /cgroup/freezer/frozen_group/freezer.state
root@server:~# cat /cgroup/freezer/frozen_group/freezer.state
FROZEN
root@server:~# cat /proc/25s731/status | grep -i state
State:      D (disk sleep)
root@server:~#

Notice how the top process output is not refreshing anymore, and upon inspection of its status file, you can see that it is now in the blocked state.

To resume it, execute the following:

root@server:~# echo THAWED > /cgroup/freezer/frozen_group/freezer.state
root@server:~# cat /proc/29328/status  | grep -i state
State:  S (sleeping)
root@server:~#

Inspecting the frozen_group hierarchy yields the following files:

root@server:~# ls -la /cgroup/freezer/frozen_group/
drwxr-xr-x 2 root root 0 Aug 25 20:50 .
drwxr-xr-x 4 root root 0 Aug 19 21:02 ..
-rw-r--r-- 1 root root 0 Aug 25 20:50 cgroup.clone_children
--w--w--w- 1 root root 0 Aug 25 20:50 cgroup.event_control
-rw-r--r-- 1 root root 0 Aug 25 20:50 cgroup.procs
-r--r--r-- 1 root root 0 Aug 25 20:50 freezer.parent_freezing
-r--r--r-- 1 root root 0 Aug 25 20:50 freezer.self_freezing
-rw-r--r-- 1 root root 0 Aug 25 21:00 freezer.state
-rw-r--r-- 1 root root 0 Aug 25 20:50 notify_on_release
-rw-r--r-- 1 root root 0 Aug 25 20:59 tasks
root@server:~#

The few files of interest are described in the following table:

File

Description

freezer.parent_freezing

Shows the parent-state. Shows 0 if none of the cgroup's ancestors is FROZEN; otherwise, 1.

freezer.self_freezing

Shows the self-state. Shows 0 if the self-state is THAWED; otherwise, 1.

freezer.state

Sets the self-state of the cgroup to either THAWED or FROZEN.

tasks

Attaches tasks to the cgroup.

Using userspace tools to manage cgroups and persist changes

Working with the cgroups subsystems by manipulating directories and files directly is a fast and convenient way to prototype and test changes, however, this comes with few drawbacks, namely the changes made will not persist a server restart and there's not much error reporting or handling.

To address this, there are packages that provide userspace  tools and daemons that are quite easy to use. Let's see a few examples.

To install the tools on Debian/Ubuntu, run the following:

root@server:~# apt-get install -y cgroup-bin cgroup-lite libcgroup1
root@server:~# service cgroup-lite start

On RHEL/CentOS, execute the following:

root@server:~# yum install libcgroup
root@server:~# service cgconfig start

To mount all subsystems, run the following:

root@server:~# cgroups-mount
root@server:~# cat /proc/mounts | grep cgroup
cgroup /sys/fs/cgroup/memory cgroup rw,relatime,memory,release_agent=/run/cgmanager/agents/cgm-release-agent.memory 0 0
cgroup /sys/fs/cgroup/devices cgroup rw,relatime,devices,release_agent=/run/cgmanager/agents/cgm-release-agent.devices 0 0
cgroup /sys/fs/cgroup/freezer cgroup rw,relatime,freezer,release_agent=/run/cgmanager/agents/cgm-release-agent.freezer 0 0
cgroup /sys/fs/cgroup/blkio cgroup rw,relatime,blkio,release_agent=/run/cgmanager/agents/cgm-release-agent.blkio 0 0
cgroup /sys/fs/cgroup/perf_event cgroup rw,relatime,perf_event,release_agent=/run/cgmanager/agents/cgm-release-agent.perf_event 0 0
cgroup /sys/fs/cgroup/hugetlb cgroup rw,relatime,hugetlb,release_agent=/run/cgmanager/agents/cgm-release-agent.hugetlb 0 0
cgroup /sys/fs/cgroup/cpuset cgroup rw,relatime,cpuset,release_agent=/run/cgmanager/agents/cgm-release-agent.cpuset,clone_children 0 0
cgroup /sys/fs/cgroup/cpu cgroup rw,relatime,cpu,release_agent=/run/cgmanager/agents/cgm-release-agent.cpu 0 0
cgroup /sys/fs/cgroup/cpuacct cgroup rw,relatime,cpuacct,release_agent=/run/cgmanager/agents/cgm-release-agent.cpuacct 0 0

Notice from the preceding output the location of the cgroups - /sys/fs/cgroup. This is the default location on many Linux distributions and in most cases the various subsystems have already been mounted.

To verify what cgroup subsystems are in use, we can check with the following commands:

root@server:~# cat /proc/cgroups
#subsys_name  hierarchy  num_cgroups  enabled
cpuset  7  1  1
cpu  8  2  1
cpuacct  9  1  1
memory  10  2  1
devices  11  1  1
freezer  12  1  1
blkio  6  3  1
perf_event  13  1  1
hugetlb  14  1  1

Next, let's create a blkio hierarchy and add an already running process to it with cgclassify. This is similar to what we did earlier, by creating the directories and the files by hand:

root@server:~# cgcreate -g blkio:high_io
root@server:~# cgcreate -g blkio:low_io
root@server:~# cgclassify -g blkio:low_io $(pidof app1)
root@server:~# cat /sys/fs/cgroup/blkio/low_io/tasks
8052
root@server:~# cgset -r blkio.weight=1000 high_io
root@server:~# cgset -r blkio.weight=100 low_io
root@server:~# cat /sys/fs/cgroup/blkio/high_io/blkio.weight
1000
root@server:~#

Now that we have defined the high_io and low_io cgroups and added a process to them, let's generate a configuration file that can be used later to reapply the setup:

root@server:~# cgsnapshot -s -f /tmp/cgconfig_io.conf
cpuset = /sys/fs/cgroup/cpuset;
cpu = /sys/fs/cgroup/cpu;
cpuacct = /sys/fs/cgroup/cpuacct;
memory = /sys/fs/cgroup/memory;
devices = /sys/fs/cgroup/devices;
freezer = /sys/fs/cgroup/freezer;
blkio = /sys/fs/cgroup/blkio;
perf_event = /sys/fs/cgroup/perf_event;
hugetlb = /sys/fs/cgroup/hugetlb;
root@server:~# cat /tmp/cgconfig_io.conf
# Configuration file generated by cgsnapshot
mount {
    blkio = /sys/fs/cgroup/blkio;
}
group low_io {
    blkio {
        blkio.leaf_weight="500";
    blkio.leaf_weight_device="";
    blkio.weight="100";
    blkio.weight_device="";
    blkio.throttle.write_iops_device="";
    blkio.throttle.read_iops_device="";
    blkio.throttle.write_bps_device="";
    blkio.throttle.read_bps_device="";
    blkio.reset_stats="";
  }
}
group high_io {
blkio {
    blkio.leaf_weight="500";
    blkio.leaf_weight_device="";
    blkio.weight="1000";
    blkio.weight_device="";
    blkio.throttle.write_iops_device="";
    blkio.throttle.read_iops_device="";
    blkio.throttle.write_bps_device="";
    blkio.throttle.read_bps_device="";
    blkio.reset_stats="";
  }
}
root@server:~#

To start a new process in the high_io group, we can use the cgexec command:

root@server:~# cgexec -g blkio:high_io bash
root@server:~# echo $$
19654
root@server:~# cat /sys/fs/cgroup/blkio/high_io/tasks
19654
root@server:~#

In the preceding example, we started a new bash process in the high_io cgroup, as confirmed by looking at the tasks file.

To move an already running process to the memory subsystem, first we create the high_prio and low_prio groups and move the task with cgclassify:

root@server:~# cgcreate -g cpu,memory:high_prio
root@server:~# cgcreate -g cpu,memory:low_prio
root@server:~# cgclassify -g cpu,memory:high_prio 8052
root@server:~# cat /sys/fs/cgroup/memory/high_prio/tasks
8052
root@server:~# cat /sys/fs/cgroup/cpu/high_prio/tasks
8052
root@server:~#

To set the memory and CPU limits, we can use the cgset command. In contrast, remember that we used the echo command to manually move the PIDs and memory limits to the tasks and the memory.limit_in_bytes files:

root@server:~# cgset -r memory.limit_in_bytes=1G low_prio
root@server:~# cat /sys/fs/cgroup/memory/low_prio/memory.limit_in_bytes
1073741824
root@server:~# cgset -r cpu.shares=1000 high_prio
root@server:~# cat /sys/fs/cgroup/cpu/high_prio/cpu.shares
1000
root@server:~#

To see how the cgroup hierarchies look, we can use the lscgroup utility:

root@server:~# lscgroup
cpuset:/
cpu:/
cpu:/low_prio
cpu:/high_prio
cpuacct:/
memory:/
memory:/low_prio
memory:/high_prio
devices:/
freezer:/
blkio:/
blkio:/low_io
blkio:/high_io
perf_event:/
hugetlb:/
root@server:~#

The preceding output confirms the existence of the blkio, memory, and cpu hierarchies and their children.

Once finished, you can delete the hierarchies with cgdelete, which deletes the respective directories on the VFS:

root@server:~# cgdelete -g cpu,memory:high_prio
root@server:~# cgdelete -g cpu,memory:low_prio
root@server:~# lscgroup
cpuset:/
cpu:/
cpuacct:/
memory:/
devices:/
freezer:/
blkio:/
blkio:/low_io
blkio:/high_io
perf_event:/
hugetlb:/
root@server:~#

To completely clear the cgroups, we can use the cgclear utility, which will unmount the cgroup directories:

root@server:~# cgclear
root@server:~# lscgroup
cgroups can't be listed: Cgroup is not mounted
root@server:~#

Managing resources with systemd

With the increased adoption of systemd as an init system, new ways of manipulating cgroups were introduced. For example, if the cpu controller is enabled in the kernel, systemd will create a cgroup for each service by default. This behavior can be changed by adding or removing cgroup subsystems in the configuration file of systemd, usually found at /etc/systemd/system.conf.

If multiple services are running on the server, the CPU resources will be shared equally among them by default, because systemd assigns equal weights to each. To change this behavior for an application, we can edit its service file and define the CPU shares, allocated memory, and I/O.

The following example demonstrates how to change the CPU shares, memory, and I/O limits for the nginx process:

root@server:~# vim /etc/systemd/system/nginx.service
.include /usr/lib/systemd/system/httpd.service
[Service]
CPUShares=2000
MemoryLimit=1G
BlockIOWeight=100

To apply the changes first reload systemd, then nginx:

root@server:~#  systemctl daemon-reload
root@server:~#  systemctl restart httpd.service
root@server:~#  

This will create and update the necessary control files in /sys/fs/cgroup/systemd and apply the limits.

Summary

The advent of kernel namespaces and cgroups made it possible to isolate groups of processes in a self-confined lightweight virtualization package; we call them containers. In this chapter, we saw how containers provide the same features as other full-fledged hypervisor-based virtualization technologies such as KVM and Xen, without the overhead of running multiple kernels in the same operating system. LXC takes full advantage of Linux cgroups and namespaces to achieve this level of isolation and resource control.

With the foundation gained from this chapter, you'll be able to understand better what's going on under the hood, which will make it much easier to troubleshoot and support the full life cycle of Linux containers, as we'll do in the next chapters.

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Get the most practical and up-to-date resource on LXC and take full advantage of what Linux containers can offer in the day-to-day operations of large-scale applications
  • Learn how to deploy and administer various workloads such as web applications inside LXC
  • Save your organization time and money by building robust and secure containers and by speeding the deployment process of your software

Description

In recent years, containers have gained wide adoption by businesses running a variety of application loads. This became possible largely due to the advent of kernel namespaces and better resource management with control groups (cgroups). Linux containers (LXC) are a direct implementation of those kernel features that provide operating system level virtualization without the overhead of a hypervisor layer. This book starts by introducing the foundational concepts behind the implementation of LXC, then moves into the practical aspects of installing and configuring LXC containers. Moving on, you will explore container networking, security, and backups. You will also learn how to deploy LXC with technologies like Open Stack and Vagrant. By the end of the book, you will have a solid grasp of how LXC is implemented and how to run production applications in a highly available and scalable way.

Who is this book for?

This book is for Linux engineers and software developers who are looking to deploy applications in a fast, secure, and scalable way for use in testing and production.

What you will learn

  • Deep dive into the foundations of Linux containers with kernel namespaces and cgroups
  • Install, configure, and administer Linux containers with LXC and libvirt
  • Begin writing applications using Python libvirt bindings
  • Take an in-depth look at container networking
  • Set up monitoring and security with LXC
  • Build and deploy a highly available application with LXC in the cloud
Estimated delivery fee Deliver to Brazil

Standard delivery 10 - 13 business days

R$63.95

Premium delivery 3 - 6 business days

R$203.95
(Includes tracking information)

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Feb 28, 2017
Length: 352 pages
Edition : 1st
Language : English
ISBN-13 : 9781785888946
Vendor :
Linux Foundation
Languages :
Tools :

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Shipping Address

Billing Address

Shipping Methods
Estimated delivery fee Deliver to Brazil

Standard delivery 10 - 13 business days

R$63.95

Premium delivery 3 - 6 business days

R$203.95
(Includes tracking information)

Product Details

Publication date : Feb 28, 2017
Length: 352 pages
Edition : 1st
Language : English
ISBN-13 : 9781785888946
Vendor :
Linux Foundation
Languages :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
R$50 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
R$500 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just R$25 each
Feature tick icon Exclusive print discounts
R$800 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just R$25 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total R$ 986.97
Mastering KVM Virtualization
R$339.99
Containerization with LXC
R$306.99
KVM Virtualization Cookbook
R$339.99
Total R$ 986.97 Stars icon
Banner background image

Table of Contents

9 Chapters
1. Introduction to Linux Containers Chevron down icon Chevron up icon
2. Installing and Running LXC on Linux Systems Chevron down icon Chevron up icon
3. Command-Line Operations Using Native and Libvirt Tools Chevron down icon Chevron up icon
4. LXC Code Integration with Python Chevron down icon Chevron up icon
5. Networking in LXC with the Linux Bridge and Open vSwitch Chevron down icon Chevron up icon
6. Clustering and Horizontal Scaling with LXC Chevron down icon Chevron up icon
7. Monitoring and Backups in a Containerized World Chevron down icon Chevron up icon
8. Using LXC with OpenStack Chevron down icon Chevron up icon
A. LXC Alternatives to Docker and OpenVZ Chevron down icon Chevron up icon

Customer reviews

Rating distribution
Full star icon Full star icon Full star icon Full star icon Full star icon 5
(2 Ratings)
5 star 100%
4 star 0%
3 star 0%
2 star 0%
1 star 0%
Mirko Longa Soto Sep 27, 2021
Full star icon Full star icon Full star icon Full star icon Full star icon 5
El autor me sorprende con un libro cargado de ejemplos, pasando desde los fundamentos hasta ejemplos concretos de poner en servicio sistemas completos de nodos redundantes con balanceadores de carga incluido y todo con LXC, un buen libro para tener en cuenta.
Amazon Verified review Amazon
Sean Embry May 15, 2017
Full star icon Full star icon Full star icon Full star icon Full star icon 5
If you need to know containerization, this is the go to source for you. Konstantin knows his stuff and with a clear and incisive writing style, makes sure you know it to.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is the delivery time and cost of print book? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela
What is custom duty/charge? Chevron down icon Chevron up icon

Customs duty are charges levied on goods when they cross international borders. It is a tax that is imposed on imported goods. These duties are charged by special authorities and bodies created by local governments and are meant to protect local industries, economies, and businesses.

Do I have to pay customs charges for the print book order? Chevron down icon Chevron up icon

The orders shipped to the countries that are listed under EU27 will not bear custom charges. They are paid by Packt as part of the order.

List of EU27 countries: www.gov.uk/eu-eea:

A custom duty or localized taxes may be applicable on the shipment and would be charged by the recipient country outside of the EU27 which should be paid by the customer and these duties are not included in the shipping charges been charged on the order.

How do I know my custom duty charges? Chevron down icon Chevron up icon

The amount of duty payable varies greatly depending on the imported goods, the country of origin and several other factors like the total invoice amount or dimensions like weight, and other such criteria applicable in your country.

For example:

  • If you live in Mexico, and the declared value of your ordered items is over $ 50, for you to receive a package, you will have to pay additional import tax of 19% which will be $ 9.50 to the courier service.
  • Whereas if you live in Turkey, and the declared value of your ordered items is over € 22, for you to receive a package, you will have to pay additional import tax of 18% which will be € 3.96 to the courier service.
How can I cancel my order? Chevron down icon Chevron up icon

Cancellation Policy for Published Printed Books:

You can cancel any order within 1 hour of placing the order. Simply contact customercare@packt.com with your order details or payment transaction id. If your order has already started the shipment process, we will do our best to stop it. However, if it is already on the way to you then when you receive it, you can contact us at customercare@packt.com using the returns and refund process.

Please understand that Packt Publishing cannot provide refunds or cancel any order except for the cases described in our Return Policy (i.e. Packt Publishing agrees to replace your printed book because it arrives damaged or material defect in book), Packt Publishing will not accept returns.

What is your returns and refunds policy? Chevron down icon Chevron up icon

Return Policy:

We want you to be happy with your purchase from Packtpub.com. We will not hassle you with returning print books to us. If the print book you receive from us is incorrect, damaged, doesn't work or is unacceptably late, please contact Customer Relations Team on customercare@packt.com with the order number and issue details as explained below:

  1. If you ordered (eBook, Video or Print Book) incorrectly or accidentally, please contact Customer Relations Team on customercare@packt.com within one hour of placing the order and we will replace/refund you the item cost.
  2. Sadly, if your eBook or Video file is faulty or a fault occurs during the eBook or Video being made available to you, i.e. during download then you should contact Customer Relations Team within 14 days of purchase on customercare@packt.com who will be able to resolve this issue for you.
  3. You will have a choice of replacement or refund of the problem items.(damaged, defective or incorrect)
  4. Once Customer Care Team confirms that you will be refunded, you should receive the refund within 10 to 12 working days.
  5. If you are only requesting a refund of one book from a multiple order, then we will refund you the appropriate single item.
  6. Where the items were shipped under a free shipping offer, there will be no shipping costs to refund.

On the off chance your printed book arrives damaged, with book material defect, contact our Customer Relation Team on customercare@packt.com within 14 days of receipt of the book with appropriate evidence of damage and we will work with you to secure a replacement copy, if necessary. Please note that each printed book you order from us is individually made by Packt's professional book-printing partner which is on a print-on-demand basis.

What tax is charged? Chevron down icon Chevron up icon

Currently, no tax is charged on the purchase of any print book (subject to change based on the laws and regulations). A localized VAT fee is charged only to our European and UK customers on eBooks, Video and subscriptions that they buy. GST is charged to Indian customers for eBooks and video purchases.

What payment methods can I use? Chevron down icon Chevron up icon

You can pay with the following card types:

  1. Visa Debit
  2. Visa Credit
  3. MasterCard
  4. PayPal
What is the delivery time and cost of print books? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela