The performance begins with the CPU but does not end there
In the previous chapter, we studied the CPU resources and the ways to use them for optimal performance. In particular, we observed that CPUs have the ability to do quite a lot of computation in parallel (instruction-level parallelism). We demonstrated it on multiple benchmarks, which show that the CPU can do many operations per cycle without any performance penalty: adding and subtracting two numbers, for example, takes just as much time as only adding them.
You might have noticed, however, that these benchmarks and examples have one rather unusual property. Consider the following example:
for (size_t i = 0; i < N; ++i) { a1 += p1[i] + p2[i]; a2 += p1[i] * p2[i]; ...