Understanding system calls

An operating system is a piece of software designed to execute one or more applications simultaneously, while also providing the resources needed for those applications to execute. To accomplish this, the operating system must be capable of dividing hardware resources between all the applications executing on the system at the same time.

For example, most personal computers (PCs) have a single hard disk that stores all the files being used by the owner of the PC. On modern PCs, it's likely the user will want to execute several applications at once—for example, a web browser and an office suite.

Both of these applications will need exclusive access to the hard disk at various times while executing. In the case of the web browser, this might be to cache websites to disk, while in the case of the office suite, this might be to store documents.

It's the operating system's responsibility to manage the applications and their access to the hard disk, to ensure that both the web browser and the office suite are able to execute properly.

To accomplish this, operating systems provide an application programming interface (API) that applications can leverage to accomplish their tasks. Accessing the hard disk is an example of one of these tasks. The read() and write() functions are examples of APIs provided by POSIX-compliant operating systems for reading from and writing data to file descriptors.

Under the hood, these APIs make calls to the operating system using an application binary interface (ABI) called a system call. The act of making system calls to accomplish tasks provided by the operating system is called system programming, which is the main focus of this book.

The anatomy of a system call

For the purposes of this section, we will focus our examples on the Intel x86 architecture, although these examples apply to most other CPU architectures.

The original x86 architecture leveraged interrupts to provide system call ABIs. The APIs provided by the operating system would program specific registers on the CPU, and make a call to the operating system using an interrupt.

For example, using BIOS, an application could read data from a hard disk using int 0x13 with the following register layout:

AH = 2
AL: Sectors to read
CH: Cylinder
CL: Sector
DH: Head
DL: Drive
ES:BX: Buffer address

The application author would use the read() API command to read this data, while under the hood, read() would perform the system call using the preceding ABI. When int 0x13 executed, the application would be paused by the hardware, and the operating system (in this case, BIOS) would execute on behalf of the application to read data from the disk and return the result in the buffer provided by the application.

Once complete, BIOS would execute iret (interrupt return) to return to the application, which would then have the data read from disk waiting in its buffer to be used.

With this approach, the application doesn't need to know how to physically interface with the hard disk on that specific computer in order to read data; a task that is meant to be handled by the operating system and its device drivers.

The application doesn't have to worry about other applications that may be executing either. It can simply leverage the provided API (or ABI, depending on the operating system), and the rest of the gory details are handled by the operating system.

In other words, system calls provide a clean delineation between applications, to help the user accomplish specific tasks, and to help the operating system whose job it is to manage these applications and the hardware resources they require.

Interrupts are, however, slow. The hardware makes no assumptions about how the operating system is written, or how the applications the operating system is executing are written or organized. For this reason, interrupts must save the CPU state before the interrupt handler is executed, and restore this state when the iret command is executed, leading to poor performance.

As will be shown, applications make a lot of system calls when attempting to perform their job, and this poor performance became a bottleneck on x86 architectures (as well as other CPU architectures).

To solve this issue, modern versions of Intel x86 CPU provided fast system call instructions. These instructions were designed specifically to address the performance bottleneck of interrupt-driven system calls. However, they require coordination between the CPU, the operating system, and the applications executing on that operating system to reduce overhead.

Specifically, the operating system must structure the memory layout of itself and the applications it's running in a specific way, dictated by the CPU. By predefining the memory layout of the operating system and its associated applications, the CPU no longer needs to save and restore as much CPU state when performing a system call, reducing overhead. How this is accomplished is different depending on whether you're executing on an Intel or AMD x86 CPU.

The most important thing to understand with respect to how a system call is performed is that a system call is not cheap. Even with fast system call support, a system call has to perform a lot of work. In the case of reading data from a hard disk via the read() API, the CPU register state must be set up and a system call instruction must be executed. CPU control is handed off to the operating system to read data from the disk.

Since more than one application might be executing, and attempting to read data from the disk at the same time, the operating system might have to pause the application so that it can service another.

Once the operating system is ready to service the application, it must first figure out what data the application is attempting to read, which ultimately determines which physical device it needs to work with. In our example, this is a hard disk, but on a POSIX-compliant system it could be any type of block device.

Next, the operating system must leverage one of its device drivers to read data from this disk. This takes time, as the operating system has to physically program the hard disk to ask for data from a specific location, over a hardware bus that almost certainly is not executing at the same speed as the CPU itself.

Once the hard disk finally provides the operating system with the requested data, the operating system can provide this information back to the application and return control, restoring the CPU state to the application. All of this insanity is obscured by a single call to read().

For this reason, system calls should be executed sparingly, and only when absolutely needed, to prevent the poor performance of the resulting application.

It should be noted that this type of optimization requires a deep understanding of the APIs the application leverages, as higher-level APIs make their own system calls on the API's behalf. For example, allocating memory, as will be discussed later, is another type of system call.

For example, look at the difference between using an std::array{} or a std::vector{} command. std::vector{} supports resizing of the array being managed under the hood, which requires memory allocation. This can not only lead to memory fragmentation (a topic that will be discussed later on in this book), but also poor performance, as the memory allocation might have to ask the operating system for more system RAM.

Learning about different types of system calls

Almost every application that executes on a POSIX-compliant operating system must make a couple of system calls. Here, we outline some of the system call types that will be explored in this book.

Console input/output

If you have ever executed a command-line application, you willbe familiar with the concept of console-based input/output. This is especially true with respect to POSIX-compliant operating systems. When outputting to the console, you can either output to stdout (typically used for normal output) or stderr (typically used for outputting error messages).

Outputting to stdout and stderr is accomplished by an application performing a system that asks the operating system to deliver a character buffer to these output devices. (It should be noted that, in this book, we typically state that we are outputting to stdout, not printing to the console.)

The reason for this is that, on POSIX-compliant systems, your application doesn't actually know where it is sending the text to. The application leverages an API to output to stdout. This can be accomplished by:

Writing to a dedicated file handle (that is, stdout)
Using C APIs such as printf
Using C++ APIs such as std::cout
Forking an application that outputs to stdout for you (for example, by using echo)

Most of these examples, when all is said and done, make a system call to the operating system to transfer a character buffer to a device that manages stdout or stderr. In some cases, this causes the operating system to relay the resulting character buffer to the parent process (likely your shell), which will ultimately make another system call to display the character buffer on the screen.

However your operating system decides to handle this, a device driver exists in the operating system that manages the physical monitor used to display text, and the simple APIs the application calls to output text (for example, printf and std::cout) eventually provide this device driver with the requested character buffer.

Although, on most systems, the text being output to stdout is usually provided to your shell and eventually displayed on the screen, this doesn't have to be the case. Since the application is making a system call to output the character buffer, the operating system is free to forward this data to a serial device, log file, as input to another application, and so on.

This flexibility is one of the reasons POSIX-compliant operating systems are so powerful, and why learning how to properly make system calls is so important.

Memory allocation

Memory is another resource that an application must request using a system call. Most applications are given global and stack memory resources when the application is first executed, along with a small heap of memory that the application can use when calls to functions such as malloc() and free() are made.

If the application only uses the memory that it is initially given in this heap, no extra memory needs to be requested by the application. If, however, heap memory runs out, the application's malloc() or free() engine will have to ask the operating system (via a system call) for more memory.

To do this, the operating system will extend the end of the application by adding more physical memory to the application. The malloc() or free() engine is then able to make use of this additional memory, until more is needed.

On systems with limited RAM, when a request for additional memory is made, the operating system has to take memory from other applications that aren't currently executing. It does this by swapping these applications to disk, an operation that is expensive to perform.

For this reason, on resource-constrained systems, calls to malloc() or free() should not be made in time-critical code, as the time it takes to execute these functions can vary greatly.

We will go into further detail on memory management in Chapter 7, A Comprehensive Look at Memory Management.

File input/output

Reading and writing to a file is another common use case for most applications that requires making system calls.

It should be noted that on POSIX-compliant systems, reading and writing to a file descriptor doesn't always mean reading and writing to a file on a storage device. Instead, the system calls you make write to character or block devices. This could be a storage device, but could also be a console device, or even a virtual device such as /dev/random, which provides random data when read.

In Chapter 8, Learning to Program File Input/Output, we will provide more information about file input/output system programming.

Networking

Networking is another common use case that requires making system calls. On POSIX-compliant systems, we perform network-based system programming by working with POSIX sockets. Sockets provide an API for programming the Network Interface Controller (NIC), and support logic (for example, the TCP/IP stack) within the operating system.

Networking itself is an extremely complicated topic, deserving of its own book, but thankfully, the system calls needed to perform this type of programming are simple, with the majority of the gory details being handled by the operating system.

In Chapter 10, Programming POSIX Sockets Using C++, we will go into further detail on how to make these types of system calls using the socket API.

Time

Some readers might find it surprising to know that even performing simple tasks such as getting the current date and time require system calls to ask the operating system for this information. Even to this day, a dedicated chip (with a battery, in case of loss of power) is provided on the system to maintain the current date and time.

If this information is needed, a system call must be made to request it. When this happens, the operating system will ask the device driver responsible for managing the chip what date and time it is currently storing, and then this information will be returned to the application.

It should be noted that not all time interfaces require system calls. For example, most high-resolution timers, which are designed to compare a high-resolution number before and after an operation has taken place, do not need the operating system to perform this action. This is because these high-resolution timers usually exist directly in the CPU, and their values can be extracted using a simple instruction.

The downside to these types of timers is that their values in and of themselves are usually meaningless (that is, the difference between the values returned is what provides meaning, not the values themselves). Essentially, these timers are usually nothing more than a counter that increments each time the CPU ticks (that is, executes an instruction).

Since modern CPUs can dynamically change their frequency, the values these counters store depends on how long the CPU has executed since the previous power cycle, and at what frequency the CPU was set while it was executing.

There isn't even a guarantee that the value in one counter will be the same as the value read in another counter on another physical core, as each physical core is capable of changing its own frequency independently of other cores on multi-core CPUs.

The benefit of high-resolution timers is that they can be executed extremely quickly (as you are just executing an instruction that reads a counter in the CPU). The difference between two measured values can be used to carry out tasks such as measuring how long it takes to execute small functions—a task that usually doesn't work with standard timers, as they don't have enough granularity.

In Chapter 11, Time Interfaces in Unix, we will go over these details and even provide an example of how to do this yourself.

Threading and process creation

Executing multiple tasks simultaneously can be accomplished by asking the operating system to create additional threads (or even new processes). This is a common task in system programming, and there are numerous system calls to get the job done.

A process is a unit of execution that has a set of resources assigned to it (for example, memory, file descriptors, and so on.) Each application is made up of at least one process, but they can contain more than one (for example, a shell is an application that is specifically designed to run several child processes).

Each process is scheduled by the operating system to execute for a limited amount of time before the next process is given access to the CPU, and this cycle continues as needed.

Threads are like processes, but they share the same resources as other threads of the same process. Threads provide an application with an opportunity to create tasks that are capable of executing in parallel, without the need for inter-process communication methods. In Chapter 12, Learning to Program POSIX and C++ Threads, we will learn how to program threads using both POSIX and C++ APIs.

System call security risks

System calls are not without their security risks. Even on modern hardware, and using CPU architectures other than Intel, executing more than one process within an operating system with full isolation between processes is nearly impossible.

Although modern hardware and modern operating systems work hard to provide the best possible isolation and security, it should always be assumed that other, malicious processes executing alongside yours may be able to spy on what you're doing, including sensitive tasks such as decrypting user data.

This is another topic that deserves its own book, but here, we will briefly discuss two different, recent security vulnerabilities that affect system programming.

SYSRET

The fast system call interface provided by Intel and AMD was not without its issues. As stated previously, for fast system calls to work, the hardware, operating system, and applications must coordinate. This is to ensure that ABI information is handled properly, to allow the operating system to execute a system call without the need for the hardware to save the entire CPU state before execution begins.

The same applies when the system call is complete, and control must be handed back to the application. To accomplish this, the operating system must load the application's stack, and then execute the SYSRET instruction, which returns control to the application.

The problem with this approach is that a non-maskable interrupt (NMI) could fire between the operating system loading the application's stack and the execution of SYSRET. The result of this race condition is that an NMI (which is code that executes with root privileges) would be executed using the application's stack and not the kernel's stack, resulting in a possible security vulnerability or corruption.

Thankfully, there are ways for modern operating systems to prevent this type of attack, which most operating systems, such as Linux, can and do leverage.

Meltdown and Spectre

The Meltdown and Spectre attacks are a modern examples of just how complicated system calls are to implement. To support the fast execution of system calls, the kernel's memory is mapped into each application using a memory layout technical called the 3:1 split, which refers to the three-to-one ratio of application memory to kernel memory.

To prevent an application from reading/writing kernel memory, which may or may not contain highly-sensitive information such as encryption keys and passwords, modern CPU architectures provide a mechanism to lock down the kernel portion of this memory, such that only the kernel is capable of seeing it all. The application is only able to see its deprivileged portion of that memory.

To improve the performance of these modern CPUs, most architectures, including Intel, AMD, and ARM, incorporate a technology called speculative execution. For example, look at the following code:

if (x) {
    do_y();
}

do_z();

The CPU doesn't know whether x is true or false until it executes this instruction. If the CPU assumes that x is true, it can enhance performance by saving some CPU cycles. If x does, in fact, end up being true, the CPU saves cycles, whereas if x is actually false, the penalty is usually worth the risk, especially if the CPU can make an educated guess as to the likelihood of x being true instead of false (for example, if the CPU executed this statement in the past and x was true).

This type of optimization is called speculative execution. The CPU is executing code, even though it's possible the code may later turn out to be invalid and need to be undone.

Speculative execution attacks such as Meltdown and Spectre exploit this process to bypass the memory protections that protect the system call interface between an application and its kernel. This is done by convincing the CPU to speculatively execute an instruction that would typically cause a security violation (for example, attempting to read a password from kernel memory).

If the CPU speculatively executes this type of instruction, there will be a gap between the CPU loading the password into the CPU's cache, and the CPU figuring out that a security violation has occurred. If the CPU is interrupted during this gap (using what is called a transient instruction), the password will be left in the CPU's cache, even though the instruction never actually completed its execution.

To recover the password from the cache, attackers leverage additional attacks on the CPU called side-channel attacks, which are specifically designed to read the contents of a CPU's cache without performing a direct memory operation.

The end result is that an attacker is capable of setting up an elaborate set of conditions that will eventually allow them to recover sensitive information stored in the kernel, using nothing more than an unprivileged application (which could be a website you happened to click on while looking for cat videos).

If this seems complicated, that's because it is. These types of attacks are extremely sophisticated. The goal of these examples is to provide a brief overview of why system calls are not without their issues. Depending on the CPU and operating system you're executing on, you might have to take special care when handling sensitive information while system programming.