Hands-On Concurrency with Rust

Preliminaries – Machine Architecture and Getting Started with Rust

In this chapter, we'll work through the preliminaries of the book, making sure to cover the basics necessary to frame the remainder of the book. This book is about parallel programming in the Rust programing language. It's essential, then, to understand modern computing hardware at a high level. We'll need a sense of how a modern CPU operates, how memory buses influence the CPU's ability to perform work and what it means for a computer to be able to do multiple things at once. In addition, we'll discuss validating your Rust installation, and cover generating executables for the two CPU architectures that this book will be concerned with.

By the close of this chapter, we will have:

Discussed a high-level model of CPU operations
Discussed a high-level model of computer memory
Had a preliminary discussion of the Rust memory model
Investigated generating runnable Rust programs for x86 and ARM
Investigated debugging these programs

The machine

In this book, independent of the specific Rust techniques, we'll attempt to teach a kind of mechanical sympathy with the modern parallel computer. There are two model kinds of parallelism we'll touch on—concurrent memory operations and data parallelism. We'll spend most of our time in this book on concurrent memory operations, the kind of parallelism in which multiple CPUs contend to manipulate a shared, addressable memory. Data parallelism, where the CPU is able to operate with a single or multiple instructions on multiple words at a time concurrently, will be touched on, but the details are CPU specific, and the necessary intrinsics are only now becoming available in the base language as this book goes to press. Fortunately, Rust, as a systems language with modern library management, will easily allow us to pull in an appropriate library and emit the correct instructions, or we could inline the assembly ourselves.

Literature on the abstract construction of algorithms for parallel machines must choose a machine model to operate under. The parallel random access machine (PRAM) is common in the literature. In this book, we will focus on two concrete machine architectures:

These machines were chosen because they are common and because they each have specific properties that will be important when we get to Chapter 6, Atomics – the Primitives of Synchronization. Actual machines deviate from the the PRAM model in important ways. Most obviously, actual machines have a limited number of CPUs and a bounded amount of RAM. Memory locations are not uniformly accessible from each CPU; in fact, cache hierarchies have a significant impact on the performance of computer programs. None of this is to say that PRAM is an absurd simplification, nor is this true of any other model you'll find in literature. What should be understood is that as we work, we'll need to draw lines of abstraction out of necessity, where further detail does not improve our ability to solve problems well. We also have to understand how our abstractions, suited to our own work, relate to the abstractions of others so that we can learn and share. In this book, we will concern ourselves with empirical methods for understanding our machines, involving careful measurement, examination of assembly code, and experimentation with alternative implementations. This will be combined with an abstract model of our machines, more specific to today's machines than PRAM, but still, in the details, focusing on total cache layers, cache sizes, bus speeds, microcode versions, and so forth. The reader is encouraged to add more specificity should the need arise and should they feel so emboldened.

The CPU

The CPU is a device that interprets a stream of instructions, manipulating storage and other devices connected to it in the process. The simplest model of a CPU, and the one that is often introduced first in undergrad computer science, is that a CPU receives an instruction from some nebulous place, performs the interpretation, receives the next instruction, interprets that, and so forth. The CPU maintains an internal pace via an oscillator circuit, and all instructions take a defined number of oscillator pulses—or clock cycles—to execute. In some CPU models, every instruction will take the same number of clock cycles to execute, and in others the cycle count of instructions will vary. Some CPU instructions modify registers, have very low latency with specialized but exceedingly finite memory locations built into the CPU. Other CPU instructions modify the main memory, RAM. Other instructions move or copy information between registers and RAM or vice versa. RAM—whose read/write latency is much higher than that of registers but is much more plentiful—and other storage devices, such as SSDs, are connected to the CPU by specialized buses.

The exact nature of these buses, their bandwidth and transmission latency, varies between machine architectures. On some systems, every location in the RAM is addressable—meaning that it can be read or written to—in constant time from the available CPUs. In other systems, this is not the case—some RAM is CPU-local and some is CPU-remote. Some instructions control special hardware interrupts that cause memory ranges to be written to other bus-connected storage devices. Mostly, these devices are exceedingly slow, compared to the RAM, which is itself slow compared to the registers.

All of this is to explain that in the simplest model of a CPU, where instructions are executed serially, instructions may well end up stalling for many clock cycles while waiting for reads or writes to be executed. To that end, it's important to understand that almost all CPUs—and especially the CPUs we'll concern ourselves with in this book—perform out-of-order executions of their instructions. Just so long as a CPU can prove that two sequences of instructions access the memory distinctly—that is, they do not interfere with each other—then the CPU is free and will probably reorder instructions. This is one of the things that makes C's undefined behavior concerning uninitialized memory so interesting. Perhaps your program's future has already filled in the memory, or perhaps not. Out-of-order execution makes reasoning about a processor's behavior difficult, but its benefit is that CPUs can execute programs much faster by deferring a sequence of instructions that is stalled on some kind of memory access.

In the same spirit, most modern CPUs—and especially the CPUs we'll concern ourselves with in this book—perform branch prediction. Say that branches in our programs tend to branch the same way at execution time—for example, say we have a feature-flag test that is configured to be enabled for the lifetime of a program. CPUs that perform branch prediction will speculatively execute one side of a branch that tends to branch in a certain way while they wait on other stalled instruction pipelines' instructions. When the branch instruction sequence catches up to its branch test, and if the test goes the predicted way, there's already been a great deal of work done and the instruction sequence can skip well ahead. Unfortunately, if the branch was mispredicted, then all this prior work must be torn down and thrown away, and the correct branch will have to be computed, which is quite expensive. It's for this reason that you'll find that programmers who worry about the nitty-gritty performance characteristics of their programs will tend to fret about branches and try to remove them.

All of this is quite power-hungry, reordering instructions to avoid stalls or racing ahead to perform computations that may be thrown away. Power-hungry implies hot, which implies cooling, which implies more power expenditure. All of which is not necessarily great for the sustainability of technological civilization, depending on how the electricity for all this is generated and where the waste heat is dumped. To that end, many modern CPUs integrate some kind of power-scaling features. Some CPUs will lengthen the time between their clock pulses, meaning they execute fewer instructions in a certain span of time than they might otherwise have. Other CPUs race ahead as fast as they normally would and then shut themselves off for a spell, drawing minimal electricity and cooling in the meantime. The exact method by which to build and run a power-efficient CPU is well beyond the scope of this book. What's important to understand is that, as a result of all this, your program's execution speed will vary from run to run, all other things being equal, as the CPU decides whether or not it's time to save power. We'll see this in the next chapter when we manually set power-saving settings.

Memory and caches

The memory storage of a CPU, alluded to in the last section, is fairly minimal. It is limited to the handful of words in general-purpose registers, plus the special-purpose registers in some limited cases. Registers are very fast, owing to their construction and on-die location, but they are not suited for storage. Modern machines connect CPUs over a bus or buses to the main memory, a very large block of randomly addressable bytes. This random addressability is important as it means that, unlike other kinds of storage, the cost to retrieve the 0^th byte from RAM is not distinct from retrieving the 1,000,000,000^th byte. We programmers don't have to do any goofy trickery to ensure that our structures appear in the front of the RAM in order to be faster to retrieve or modify, whereas physical location in storage is a pressing concern for spinning hard disks and tape drives. Exactly how our CPUs interact with the memory varies between platforms, and the discussion that follows is heavily indebted to Mark Batty's description in his 2014 book, The C11 and C++11 Concurrency Model.

In a machine that exposes a sequentially consistent model of memory access, every memory load or store must be made in lockstep with one another, including in systems with multiple CPUs or multiple threads of execution per CPU. This limits important optimizations—consider the challenge of working around memory stalls in a sequentially consistent model—and so neither of the processors we'll be considering in this book offer this model. It will show up in literature by name because of the ease of reasoning about this model, and is worth knowing about, especially when studying atomics literature.

The x86 platform behaves as if it were sequentially consistent, excepting that every thread of execution maintains a FIFO buffer for writes prior to their being flushed to the main memory. In addition, there is a global lock for the coordination of atomic reads and writes between threads of execution. There's a lot to unpack here. Firstly, I use the words load and store interchangeably with read and write, as does most literature. There is also a variant distinction between plain load/store and atomic load/store. Atomic loads/stores are special in that their effects can be coordinated between threads of execution, allowing for coordination with varying degrees of guarantees. The x86 processor platform provides fence instructions that force the flush of write buffers, stalling other threads of execution attempting to access the written range of main memory until the flush is completed. This is the purpose of the global lock. Without atomicity, writes will be flushed willy-nilly. On x86 platforms, writes to an addressable location in the memory are coherent, meaning they are globally ordered, and reads will see values from that location in the order of the writes. Compared to ARM, the way this works on x86 is very simple—writes happen directly to main memory.

Let's look at an example. Taking inspiration from Batty's excellent dissertation, consider a setup where a parent thread sets two variables, x and y, to 0, then spawns two threads called, say, A and B. Thread A is responsible for setting x and then y to 1, whereas thread B is responsible for loading the value of x into a thread-local variable called thr_x and y into a thread-local variable called thr_y. This looks something like the following:

WRITE x := 0
WRITE y := 0
[A] WRITE x := 1
[A] WRITE y := 1
FLUSH A
[B] READ x
[B] WRITE thr_x := x
[B] READ y
[B] WRITE thr_y := y

In this specific example, thr_x == 1 and thr_y == 1. Had the flushes been ordered differently by the CPU, this outcome would have been different. For instance, look at the following:

WRITE x := 0
WRITE y := 0
[A] WRITE x := 1
[A] WRITE y := 1
[B] READ x
[B] WRITE thr_x := x
FLUSH A
[B] READ y
[B] WRITE thr_y := y

The consequence of this is that thr_x == 0 and thr_y == 1. Without any other coordination, the only other valid interleaving is thr_x == 0 and thr_y == 0. That is, as a result of the write buffer on x86, the write of x in thread A can never be reordered to occur after the write of y: thr_x == 1 and thr_y == 0. This kind of stinks, unless you enjoy this little program as a parlor trick. We want determinism out of our programs. To that end, x86 provides different fence and lock instructions that control how and when threads flush their local write buffers, and how and when threads may read byte ranges from the main memory. The exact interplay here is… complicated. We'll come back to it in great detail in Chapter 3, The Rust Memory Model – Ownership, References, and Manipulation and Chapter 6, Atomics – the Primitives of Synchronization. Suffice it to say that, for now, there's an SFENCE instruction available that forces a sequential consistency. We can employ this instruction as follows:

WRITE x := 0
WRITE y := 0
[A] WRITE x := 1
[A] WRITE y := 1
[A] SFENCE        [B] SFENCE
                  [B] READ x
                  [B] WRITE thr_x := x
                  [B] READ y
                  [B] WRITE thr_y := y

From this, we get thr_x == 1 and thr_y == 1. The ARM processor is a little different—1/0 is a valid interleaving. There is no global lock to the ARM, no fences, and no write buffer. In ARM, a string of instructions—called a propagation list, for reasons that will soon be clear—is maintained in an uncommitted state, meaning that the thread that originated the instructions will see them as in-flight, but they will not have been propagated to other threads. These uncommitted instructions may have been performed—resulting in side-effects in the memory—or not, allowing for speculative execution and the other performance tricks discussed in the previous section. Specifically, reads may be satisfied from a thread's local propagation list, but not writes. Branch instructions cause the propagation list that led to the branch instruction to be committed, potentially out of the order from that specified by the programmer. The memory subsystem keeps track of which propagation list has been sent to which thread, meaning that it is possible for a thread's private loads and stores to be out of order and for committed loads and stores to appear out of order between threads. Coherency is maintained on ARM with more active participation from the memory subsystem.

Whereas on x86, the main memory is something that has actions done to it, on ARM, the memory subsystem responds to requests and may invalidate previous, uncommitted requests. A write request involves a read-response event, by which it is uniquely identified, and a read request must reference a write.

This sets up a data-dependency chain. Responses to reads may be invalidated at a later time, but not writes. Each location receives a coherence-commitment ordering, which records a global order of writes, built up per-thread as propagation lists are propagated to threads.

This is very complicated. The end result is that writes to a thread's view of the memory may be done out of programmer order, writes committed to main memory may also be done out of order, and reads can be requested out of programmer order. Therefore, the following code is perfectly valid, owing to the lack of branches in our example:

WRITE x := 0
WRITE y := 0
[A] WRITE x := 1
[B] READ x
[A] WRITE y := 1
[B] WRITE thr_y := y
[B] WRITE thr_x := x
[B] READ y

ARM provides instructions for control of dependencies, called barriers. There are three types of dependency in ARM. The address dependency means a load is used to compute the address for access to the memory. The control dependency means that the program flow that leads to memory access depends on a load. We're already familiar with the data dependency.

While there is a lot of main memory available, it's not particularly fast. Well, let's clarify that. Fast here is a relative term. A photon in a vacuum will go about 30.5 centimeters in 1 nanosecond, meaning that in ideal circumstances, I, on the west coast of the United States, ought to be able to send a message to the east coast of the United States and receive a response in about 80 milliseconds. Of course, I'm ignoring request-processing time, the realities of the internet, and other factors; we're just dealing with ballpark figures, here. Consider that 80 milliseconds is 80,000,000 nanoseconds. A read access from a CPU to the main memory is around 100 nanoseconds, give or take your specific computer architecture, the details of your chips, and other factors. All this is to clarify that, when we say, not particularly fast, we're working outside normal human time scales.

It is difficult, sometimes, to keep from misjudging things as fast enough. Say, we have a 4 GHz processor. How many clock cycles do we get per nanosecond? Turns out, it's 4. Say, we have an instruction that needs to access main memory—which, remember, takes 100 nanoseconds—and happens to be able to do its work in exactly 4 cycles, or one nanosecond. That instruction will then be stalled for 99 nanoseconds while it waits, meaning we're potentially losing out on 99 instructions that could have been executed by the CPU. The CPU will make up some of that loss with its optimization tricks. These only go so far, unless our computation is very, very lucky.

In an effort to avoid the performance impact of the main memory on computation, processor manufacturers introduced caching between the processor and the main memory. Many machines these days have three layers of data cache, as well as an instruction cache, called dCACHE and iCACHE, which are tools we'll be using later. You will see dCACHE often consists of three layers these days, each layer being successively larger than the last, but also slower for cost or power concerns. The lowest, smallest layer is called L1, the next L2, and so forth. CPUs read into caches from the main memory in working blocks—or simply blocks—that are the size of the cache being read into. Memory access will preferentially be done on the L1 cache, then L2, then L3, and so forth, with time penalties at each level for missing and block reads. Cache hits, however, are significantly faster than going directly to the main memory. L1 dCACHE references are 0.5 nanoseconds, or fast enough that our hypothetical 4-cycle instruction is 8 times slower than the memory access it requires. L2 dCACHE references are 7 nanoseconds, still a fair sight better than the main memory. Of course, the exact numbers will vary from system to system, and we'll do quite a bit in the next chapter to measure them directly. Cache coherency—maintaining a high ratio of hits to misses—is a significant component of building fast software on modern machines. We'll never get away from the CPU cache in this book.

Memory model

The details of a processor's handling of memory is both complicated and very specific to that processor. Programming languages—and Rust is no exception here—invent a memory model to paper over the details of all supported processors while, ideally, leaving the programmer enough freedom to exploit the specifics of each processor. Systems languages also tend to allow absolute freedom in the form of escape hatches from the language memory model, which is exactly what Rust has in terms of unsafe.

With regard to its memory model, Rust is very much inspired by C++. The atomic orderings exposed in Rust are those of LLVM's, which are those of C++11. This is a fine thing—any literature to do with either C++ or LLVM will be immediately applicable to Rust. Memory order is a complex topic, and it's often quite helpful to lean on material written for C++ when learning. This is especially important when studying up on lock-free/wait-free structures—which we'll see later in this book—as literature on those topics often deals with it in terms of C++. Literature written with C11 in mind is also suitable, if maybe a little less straightforward to translate.

Now, this is not to say that the C++ programmer will be immediately comfortable in concurrent Rust. The digs will be familiar, but not quite right. This is because Rust's memory model also includes a notion of reference consumption. In Rust, a thing in memory must be used zero or one times, but no more. This, incidentally, is an application of a version of linear-type theory called affine typing, if you'd like to read up more on the subject. Now, the consequence of restricting memory access in this way is that Rust is able to guarantee at compile-time safe memory access—threads cannot reference the same location in memory at the same time without coordination; out-of-order access in the same thread are not allowed, and so forth. Rust code is memory safe without relying on a garbage collector. In this book's estimation, memory safety is a good win, even though the restrictions that are introduced do complicate implementing certain kinds of structures that are more straightforward to build in C++ or similar languages.

This is a topic we'll cover in much greater detail in Chapter 3, The Rust Memory Model – Ownership, References and Manipulation.

Getting set up

Now that we have a handle on the machines, this book will deal with what we need to get synchronized on our Rust compiler. There are three channels of Rust to choose from—stable, beta, and nightly. Rust operates on a six-week release cycle. Every day the nightly channel is rolled over, containing all the new patches that have landed on the master branch since the day before. The nightly channel is special, compared to beta and stable, in that it is the only version of the compiler where nightly features are able to be compiled. Rust is very serious about backward compatibility. Proposed changes to the language are baked in nightly, debated by the community, and lived with for some time before they're promoted out of nightly only status and into the language proper. The beta channel is rolled over from the current nightly channel every six weeks. The stable channel is rolled over from beta at the same time, every six weeks.

This means that a new stable version of the compiler is at most only ever six weeks old. Which channel you choose to work with is up to you and your organization. Most teams I'm aware of work with nightly and ship stable, as many important tools in the ecosystem—such as clippy and rustfmt—are only available with nightly features, but the stable channel offers, well, a stable development target. You'll find that many libraries in the crate ecosystem work to stay on stable for this reason.

Unless otherwise noted, we'll focus on the stable channel in this book. We'll need two targets installed—one for our x86 machine and the other for our ARMv7. Your operating system may package Rust for you—kudos!—but the community tends to recommend the use of rustup, especially when managing multiple, version-pinned targets. If you're unfamiliar with rustup, you can find a persuasive explanation of and instructions for its use at rustup.rs. If you're installing a Rust target for use on a machine on which rustup is run, it will do the necessary triplet detections. Assuming that your machine has an x86 chip in it and is running Linux, then the following two commands will have equivalent results:

> rustup install stable

> rustup target add x86_64-unknown-linux-gnu

Both will instruct rustup to track the target x86_64-unknown-linux-gnu and install the stable channel version of Rust. If you're running OS X or Windows, a slightly different triplet will be installed by the first variant, but it's the chip that really matters. The second target, we'll need to be more precise with:

> rustup target add stable-armv7-unknown-linux-gnueabihf

Now you have the ability to generate x86 binaries for your development machine, for x86 (which is probably the same thing), and for an ARMv7 running Linux, the readily available RaspberryPi 3. If you intend to generate executables for the ARMv7—which is recommended, if you have or can obtain the chip—then you'll also need to install an appropriate cross-compiler to link. On a Debian-based development system, you can run the following:

> apt-get install gcc-arm-linux-gnueabihf

Instructions will vary by operating system, and, honestly, this can be the trickiest part of getting a cross-compiling project set up. Don't despair. If push comes to shove, you could always compile on host. Compilation will just be pokey. The final setup step before we get into the interesting part is to tell cargo how to link our ARMv7 target. Now, please be aware that cross-compilation is an active area of work in the Rust community at the time of writing. The following configuration file fiddling may have changed a little between this book being published and your reading of it. Apologies. The Rust community documentation will surely help patch up some of the differences. Anyway, add the following—or similar, depending on your operating system—to ~/.cargo/config:

[target.armv7-unknown-linux-gnueabihf]
linker = "arm-linux-gnueabihf-gcc"

If ~/.cargo/config doesn't exist, go ahead and create it with the preceding contents.

The interesting part

Let's create a default cargo project and confirm that we can emit the appropriate assembly for our machines. Doing this will be one of the important pillars of this book. Now, choose a directory on disk to place the default project and navigate there. This example will use ~/projects, but the exact path doesn't matter. Then, generate the default project:

~projects > cargo new --bin hello_world
     Created binary (application) `hello_world` project

Go ahead and reward yourself with some x86 assembler:

hello_world > RUSTFLAGS="--emit asm" cargo build --target=x86_64-unknown-linux-gnu
   Compiling hello_world v0.1.0 (file:///home/blt/projects/hello_world)
       Finished dev [unoptimized + debuginfo] target(s) in 0.33 secs
hello_world > file target/x86_64-unknown-linux-gnu/debug/deps/hello_world-6000abe15b385411.s
target/x86_64-unknown-linux-gnu/debug/deps/hello_world-6000abe15b385411.s: assembler source, ASCII text

Please be aware that if your compilation of a Rust binary on OS X x86 will not run on Linux x86, and vice versa. This is due to the differences in the interfaces of the operating systems themselves. You're better off compiling on your x86 Linux machine or your x86 OS X machine and running the binaries there. That's the approach I take with the material presented in this book.

That said, reward yourself with some ARMv7 assembler:

hello_world > RUSTFLAGS="--emit asm" cargo build --target=armv7-unknown-linux-gnueabihf
   Compiling hello_world v0.1.0 (file:///home/blt/projects/hello_world)
       Finished dev [unoptimized + debuginfo] target(s) in 0.45 secs
hello_world > file target/armv7-unknown-linux-gnueabihf/debug/deps/hello_world-6000abe15b385411.s
target/armv7-unknown-linux-gnueabihf/debug/deps/hello_world-6000abe15b385411.s: assembler source, ASCII text

Of course, if you want to build release versions, you need only to give cargo the --release flag:

hello_world > RUSTFLAGS="--emit asm" cargo build --target=armv7-unknown-linux-gnueabihf --release
Compiling hello_world v0.1.0 (file:///home/blt/projects/hello_world)
Finished release [optimized] target(s) in 0.45 secs
hello_world > wc -l target/armv7-unknown-linux-gnueabihf/
debug/   release/
hello_world > wc -l target/armv7-unknown-linux-gnueabihf/*/deps/hello_world*.s
 1448 target/armv7-unknown-linux-gnueabihf/debug/deps/hello_world-6000abe15b385411.s
  101 target/armv7-unknown-linux-gnueabihf/release/deps/hello_world-dd65a12bd347f015.s
 1549 total

It's interesting—and instructive!—to compare the differences in the generated code. Notably, the release compilation will strip debugging entirely. Speaking of which, let's talk debuggers.

Debugging Rust programs

Depending on your language background, the debugging situation in Rust may be very familiar and comfortable, or it might strike you as a touch bare bones. Rust relies on the commonly used debugging tools that other programming languages have to hand—gdb or lldb. Both will work, though historically, lldb has had some issues, and it's only since about mid-2016 that either tool has supported unmangled Rust. Let's try gdb on hello_world from the previous section:

hello_world > gdb target/x86_64-unknown-linux-gnu/debug/hello_world
GNU gdb (Debian 7.12-6) 7.12.0.20161007-git
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from target/x86_64-unknown-linux-gnu/debug/hello_world...done.
warning: Missing auto-load script at offset 0 in section .debug_gdb_scripts
of file /home/blt/projects/hello_world/target/x86_64-unknown-linux-gnu/debug/hello_world.
Use `info auto-load python-scripts [REGEXP]' to list them.
(gdb) run
Starting program: /home/blt/projects/hello_world/target/x86_64-unknown-linux-gnu/debug/hello_world
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Hello, world!
[Inferior 1 (process 15973) exited normally]

Let's also try lldb:

hello_world > lldb --version
lldb version 3.9.1 ( revision )
hello_world > lldb target/x86_64-unknown-linux-gnu/debug/hello_world
(lldb) target create "target/x86_64-unknown-linux-gnu/debug/hello_world"
Current executable set to 'target/x86_64-unknown-linux-gnu/debug/hello_world' (x86_64).
(lldb) process launch
Process 16000 launched: '/home/blt/projects/hello_world/target/x86_64-unknown-linux-gnu/debug/hello_world' (x86_64)
Hello, world!
Process 16000 exited with status = 0 (0x00000000)

Either debugger is viable, and you're warmly encouraged to choose the one that suits your debugging style. This book will lean toward the use of lldb because of vague authorial preference.

The other suite of tooling you'll commonly see in Rust development—and elsewhere in this book—is valgrind. Rust being memory safe, you might wonder when valgrind would find use. Well, whenever you use unsafe. The unsafe keyword in Rust is fairly uncommon in day-to-day code, but does appear when squeezing out extra percentage points from hot code paths now and again. Note that unsafe blocks will absolutely appear in this book. If we run valgrind on hello_world, we'll get no leaks, as expected:

hello_world > valgrind --tool=memcheck target/x86_64-unknown-linux-gnu/debug/hello_world
==16462== Memcheck, a memory error detector
==16462== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==16462== Using Valgrind-3.12.0.SVN and LibVEX; rerun with -h for copyright info
==16462== Command: target/x86_64-unknown-linux-gnu/debug/hello_world
==16462==
Hello, world!
==16462==
==16462== HEAP SUMMARY:
==16462==     in use at exit: 0 bytes in 0 blocks
==16462==   total heap usage: 7 allocs, 7 frees, 2,032 bytes allocated
==16462==
==16462== All heap blocks were freed -- no leaks are possible
==16462==
==16462== For counts of detected and suppressed errors, rerun with: -v
==16462== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

Profiling memory use in Rust programs is an important day-to-day task when writing performance-critical projects. For this, we use Massif, the heap profiler:

hello_world > valgrind --tool=massif target/x86_64-unknown-linux-gnu/debug/hello_world
==16471== Massif, a heap profiler
==16471== Copyright (C) 2003-2015, and GNU GPL'd, by Nicholas Nethercote
==16471== Using Valgrind-3.12.0.SVN and LibVEX; rerun with -h for copyright info
==16471== Command: target/x86_64-unknown-linux-gnu/debug/hello_world
==16471==
Hello, world!
==16471==

Profiling the cache is also an important routine task. For this, we use cachegrind, the cache and branch-prediction profiler:

hello_world > valgrind --tool=cachegrind target/x86_64-unknown-linux-gnu/debug/hello_world
==16495== Cachegrind, a cache and branch-prediction profiler
==16495== Copyright (C) 2002-2015, and GNU GPL'd, by Nicholas Nethercote et al.
==16495== Using Valgrind-3.12.0.SVN and LibVEX; rerun with -h for copyright info
==16495== Command: target/x86_64-unknown-linux-gnu/debug/hello_world
==16495==
--16495-- warning: L3 cache found, using its data for the LL simulation.
Hello, world!
==16495==
==16495== I   refs:      533,954
==16495== I1  misses:      2,064
==16495== LLi misses:      1,907
==16495== I1  miss rate:    0.39%
==16495== LLi miss rate:    0.36%
==16495==
==16495== D   refs:      190,313  (131,906 rd   + 58,407 wr)
==16495== D1  misses:      4,665  (  3,209 rd   +  1,456 wr)
==16495== LLd misses:      3,480  (  2,104 rd   +  1,376 wr)
==16495== D1  miss rate:     2.5% (    2.4%     +    2.5%  )
==16495== LLd miss rate:     1.8% (    1.6%     +    2.4%  )
==16495==
==16495== LL refs:         6,729  (  5,273 rd   +  1,456 wr)
==16495== LL misses:       5,387  (  4,011 rd   +  1,376 wr)
==16495== LL miss rate:      0.7% (    0.6%     +    2.4%  )

Each of these will be used throughout the book and on much more interesting projects than hello_world. But hello_world is the first cross-compilation achieved in this text, and that's no small thing.