You're reading from System Design Guide for Software Professionals Build scalable solutions – from fundamental concepts to cracking top tech company interviews

Product type Paperback

Published in Aug 2024

Publisher Packt

ISBN-13 9781805124993

Length 384 pages

Edition 1st Edition

Concepts

Application Development

Authors (2):

Dhirendra Sinha

Tejas Chopra

View More author details

Table of Contents (21) Chapters

Preface

1. Part 1: Foundations of System Design

2. Chapter 1: Basics of System Design FREE CHAPTER

3. Chapter 2: Distributed System Attributes

4. Chapter 3: Distributed Systems Theorems and Data Structures

5. Part 2: Core Components of Distributed Systems

6. Chapter 4: Distributed Systems Building Blocks: DNS, Load Balancers, and Application Gateways

7. Chapter 5: Design and Implementation of System Components –Databases and Storage

8. Chapter 6: Distributed Cache

9. Chapter 7: Pub/Sub and Distributed Queues

10. Part 3: System Design in Practice

11. Chapter 8: Design and Implementation of System Components: API, Security, and Metrics

12. Chapter 9: System Design – URL Shortener

13. Chapter 10: System Design – Proximity Service

14. Chapter 11: Designing a Service Like Twitter

15. Chapter 12: Designing a Service Like Instagram

16. Chapter 13: Designing a Service Like Google Docs

17. Chapter 14: Designing a Service Like Netflix

18. Chapter 15: Tips for Interviewees

19. Chapter 16: System Design Cheat Sheet

20. Index

Fault tolerance

Fault tolerance in distributed systems means that the system continues functioning correctly in the presence of component failures or network problems. It involves designing and implementing a system that can detect and recover from faults automatically, without any human intervention.

To achieve fault tolerance, distributed systems employ various techniques, such as redundancy, replication, and error detection and recovery mechanisms. Redundancy involves duplicating system components or data to ensure that if one fails, another can take its place without disrupting the overall system. Replication involves creating multiple copies of data or services in different locations so that if one location fails, others can still provide the required service.

Error detection and recovery mechanisms involve constantly monitoring the system for errors or failures and taking appropriate actions to restore its normal functioning. For example, if a node fails to respond, the...