You're reading from The Art of Writing Efficient Programs An advanced programmer's guide to efficient hardware utilization and compiler optimizations using C++ examples

Product type Paperback

Published in Oct 2021

Publisher Packt

ISBN-13 9781800208117

Length 464 pages

Edition 1st Edition

Languages

C++

Tools

Cmake

Concepts

High Performance Programming

Author (1):

Fedor G. Pikus

View More author details

Table of Contents (18) Chapters

Preface

1. Section 1 – Performance Fundamentals

2. Chapter 1: Introduction to Performance and Concurrency FREE CHAPTER

3. Chapter 2: Performance Measurements

4. Chapter 3: CPU Architecture, Resources, and Performance

5. Chapter 4: Memory Architecture and Performance

6. Chapter 5: Threads, Memory, and Concurrency

7. Section 2 – Advanced Concurrency

8. Chapter 6: Concurrency and Performance

9. Chapter 7: Data Structures for Concurrency

10. Chapter 8: Concurrency in C++

11. Section 3 – Designing and Coding High-Performance Programs

12. Chapter 9: High-Performance C++

13. Chapter 10: Compiler Optimizations in C++

14. Chapter 11: Undefined Behavior and Performance

15. Chapter 12: Design for Performance

16. Assessments

17. Other Books You May Enjoy

Why data sharing is expensive

As we have just seen, concurrent (simultaneous) access of the shared data is a real performance killer. Intuitively, it makes sense: in order to avoid a data race, only one thread can operate on the shared data at any given time. We can accomplish this with a mutex or use an atomic operation if one is available. Either way, when one thread is, say, incrementing the shared variable, all other threads have to wait. Our measurements in the last section confirm it.

However, before taking any action based on observations and experiments, it is critically important to understand precisely what we measured and what can be concluded with certainty.

It is easy to describe what was observed: incrementing a shared variable from multiple threads at the same time does not scale at all and, in fact, is slower than using just one thread. This is true for both atomic shared variables and non-atomic variables guarded by a mutex. We have not tried to measure unguarded...