Low-level concurrency
Concurrency cannot be achieved without explicit hardware support. We discussed about SMT and the multi-core processors in the previous chapters. Recall that every processor core has its own L1 cache, and several cores share the L2 cache. The shared L2 cache provides a fast mechanism to the processor cores to coordinate their cache access, eliminating the comparatively expensive memory access. Additionally, a processor buffers the writes to memory into something known as a dirty write-buffer. This helps the processor to issue a batch of memory update requests, reorder the instructions, and determine the final value to write to memory, known as write absorption.
Hardware memory barrier (fence) instructions
Memory access reordering is great for a sequential (single-threaded) program performance, but it is hazardous for the concurrent programs where the order of memory access in one thread may disrupt the expectations in another thread. The processor needs the means of synchronizing...