You're reading from The Art of Writing Efficient Programs An advanced programmer's guide to efficient hardware utilization and compiler optimizations using C++ examples

Product type Paperback

Published in Oct 2021

Publisher Packt

ISBN-13 9781800208117

Length 464 pages

Edition 1st Edition

Languages

C++

Tools

Cmake

Concepts

High Performance Programming

Author (1):

Fedor G. Pikus

View More author details

Table of Contents (18) Chapters

Preface

1. Section 1 – Performance Fundamentals

2. Chapter 1: Introduction to Performance and Concurrency FREE CHAPTER

3. Chapter 2: Performance Measurements

4. Chapter 3: CPU Architecture, Resources, and Performance

5. Chapter 4: Memory Architecture and Performance

6. Chapter 5: Threads, Memory, and Concurrency

7. Section 2 – Advanced Concurrency

8. Chapter 6: Concurrency and Performance

9. Chapter 7: Data Structures for Concurrency

10. Chapter 8: Concurrency in C++

11. Section 3 – Designing and Coding High-Performance Programs

12. Chapter 9: High-Performance C++

13. Chapter 10: Compiler Optimizations in C++

14. Chapter 11: Undefined Behavior and Performance

15. Chapter 12: Design for Performance

16. Assessments

17. Other Books You May Enjoy

Data dependencies and pipelining

Our analysis of the CPU capabilities so far has shown that the processor can execute multiple operations at once as long as the operands are already in the registers: we can evaluate a fairly complex expression that depends on just two values in exactly as much time as it takes to add these values. The depends on just two values qualifier is, unfortunately, a very serious restriction. We now consider a more realistic code example, and we don't have to make many changes to our code:

for (size_t i = 0; i < N; ++i) {
     a1 += (p1[i] + p2[i])*(p1[i] - p2[i]);
}

Recall that the old code had the same loop with a simpler body: a1 += (p1[i] + p2[i]);. Also, p1[i] is just an alias for the vector element v1[i], same for p2 and v2. Why is this code more complex? We have already seen that the processor can do addition, subtraction, and multiplication in a single cycle, and the expression still depends on just two values...