Scalability is the ability of a system to adapt itself to handle a growing number of incoming requests successfully by increasing the resources available to the system. Scalability is measured by the total number of requests your application can process and respond to successfully. How do you know your application has reached its threshold of the maximum capacity limit? When it is busy processing current requests in the pipeline and can no longer take any incoming requests and process them successfully. Also, your application may not perform as expected, resulting in performance issues, and some requests will start to fail by timing out. At this stage, we must scale our application for business continuity. Let's look at the options available.
Vertical scaling or scaling up
Vertical scaling or scaling up means adding more resources to individual application servers and increasing the hardware capacity. Users send requests and the application processes the requests, reads/writes to the database, and sends responses back to the users. If the user base grows and the number of incoming requests becomes high, the application server will be overloaded, resulting in longer processing times and latency in responding to users. In this case, we can scale up the application server hardware to a higher hardware capacity, as shown in the following diagram.
Figure 1.3 – Vertical scaling (scaling up)
Horizontal scaling or scaling out
Horizontal scaling or scaling out means adding more processing servers/machines to a system. Let's say my application is running on one server and can process up to 1,000 requests per minute. I could scale out by adding 4 more servers and could process 4,000 more requests per minute, as shown in the following screenshot.
Figure 1.4 – Horizontal scaling (scaling out)
Tip
Having a single server is always a bottleneck beyond a certain load, no matter how many CPU cores and memory you have. That's when horizontal scaling or scaling out may help.
Load balancers
Load balancers help in increasing scalability by distributing incoming traffic to healthy servers within a region when the amount of simultaneous traffic increases. Load balancers have health probe monitors to monitor a given port on each of the servers to check the health, and if they're found to be unhealthy, the server is disabled from the load balancer and incoming traffic. When the next health probe test passes, the server is added back to the load balancer.
Caching
Caching is one of the key system design patterns that help in scaling any application, along with improving response times. Any application typically involves reading and writing data from and to a data store, which is usually a relational database such as SQL Server or a NoSQL database such as Cosmos DB. However, reading data from the database for every request is not efficient, especially when data is not changing, because databases usually persist data to disk and it's a costly operation to load the data from disk and send it back to the browser client (or device in the case of mobile/desktop applications) or user. This is where caching comes into play. Cache stores can be used as a primary source for retrieving data and falling back to the original data store only when data is not available in the cache, thus giving a faster response to the consuming application. While doing this, we also need to ensure that the cached data is expired/refreshed as and when data in the original data store is updated.
Distributed caching
As we know, in a distributed system, the data store is split across multiple servers; similarly, distributed caching is an extension of traditional caching in which cached data is stored in more than one server in a network. Before we get into distributed caching, here's a quick recap of the CAP theorem:
- C: Stands for consistency, meaning the data is consistent across all the nodes and has the same copy of data
- A: Stands for availability, meaning the system is available, and failure of one node doesn't cause the system to go down
- P: Stands for partition tolerant, meaning the system doesn't go down even if the communication between nodes goes down
As per the CAP theorem, any distributed system can only achieve two of the preceding principles, and as distributed systems must be partition-tolerant (P), we can only achieve either the consistency (C) of data or the high availability (A) of data.
So, distributed caching is a cache strategy in which data is stored in multiple servers/nodes/shards outside the application server. Since data is distributed across multiple servers, if one server goes down, another server can be used as a backup to retrieve data. For example, if our system wanted to cache countries, states, and cities, and if there were three caching servers in a distributed caching system, hypothetically there would be a possibility that one of the cache servers would cache countries, another one would cache states, and one would cache cities (of course, in a real-time application, data is split in a much more complex way). Also, each server would additionally act as a backup for one or more entities. So, on a high level, one type of distributed cache system looks as shown:
Figure 1.5 – Distributed caching high-level representation
As you can see, while reading data, it is read from the primary server, and if the primary server is not available, the caching system will fall back to the secondary server. Similarly, for writes, write operations are not complete until data is written to the primary as well as the secondary server, and until this operation is completed, read operations can be blocked, hence compromising the availability of the system. Another strategy for writes could be background synchronization, which will result in the eventual consistency of data, hence compromising the consistency of data until synchronization is completed. Going back to the CAP theorem, most distributed caching systems fall under the category of CP or AP.
Sharding
Sharding can improve scalability when storing and accessing large data from data stores. This is achieved by splitting a single data store into multiple horizontal partitions or shards. As the data is split across a cluster of databases, the system will be able to store a large amount of data and at the same time, the system can handle additional requests. We can continue to scale the system out by adding further shards.
Here are a few important considerations:
- Keep shards balanced for even load distribution. Periodically rebalance shards as data is updated and removed from each shard.
- Avoid queries that retrieve data from multiple shards as they are not efficient and cause a performance bottleneck. You can use parallel tasks to fetch data from different shards for better efficiency but it adds complexity.
- Creating a large number of smaller shards is better for load balancing than a small number of large shards.
In the next section, we will see how to design applications for high availability.