Using sharding and data distribution
In this section, you will learn about basic scalability techniques, such as database sharding. Sharding is widely used in high-end systems and offers a simple and reliable way to scale out a setup. In recent years, it has become the standard way to scale up professional systems.
Understanding the purpose of sharding
What happens if your setup grows beyond the capacity of a single machine in a single-master setup? What if you want to run so many transactions that one server is simply not able to keep up with them? Let's assume you have millions of users and tens of thousands among them want to perform a certain task at the same time.
Clearly, at some point, you cannot buy servers that are big enough to handle an infinite load anymore. It is simply impossible to run a Facebook- or Google-like application on a single server. At some point, you have to come up with a scalability strategy that serves your needs. This is when sharding comes into play.
The idea of sharding is simple: What if you could split data in a way that it can reside on different nodes?
Designing a sharded system
To demonstrate the basic concept of sharding, let's assume the following scenario: we want to store information about millions of users. Each user has a unique user ID. Let's further assume that we have only two servers. In this case, we can store even user IDs on server 1 and odd user IDs on server 2.
The following diagram shows how this can be done:
As you can see in our diagram, we have nicely distributed the data. Once this is done, we can send a query to the system, as follows:
SELECT * FROM t_user WHERE id = 4;
The client can easily figure out where to find the data by inspecting the filter in our query. In our example, the query will be sent to the first node because we are dealing with an even number.
As we have distributed the data based on a key (in this case, the user ID), we can search for any person easily if we know the key. In large systems, referring to users through a key is a common practice, and therefore, this approach is suitable. By using this simple approach, we have also easily doubled the number of machines in our setup.
When designing a system, we can easily come up with an arbitrary number of servers; all we have to do is to invent a nice and clever partitioning function to distribute the data inside our server farm. If we want to split the data between 10 servers (not a problem), how about using user ID % 10
as a partitioning function? If you are interested in sharding, consider checking out shard_manager
, which is available on PGXN.
When you are trying to break up data and store it on different hosts, always make sure that you are using a proper partitioning function. It can be very beneficial to split data in such a way that each host has more or less the same amount of data.
Splitting users alphabetically might not be a good idea. The reason for that is that not all the letters are equally likely. We cannot simply assume that the letters from A to M occur as often as the letters from N to Z. This can be a major issue if you want to distribute a dataset to a thousand servers instead of just a handful of machines. As stated before, it is essential to have a proper partitioning function, that produces evenly distributed results.
Tip
In many cases, a hash function will provide you with nicely and evenly distributed data. This can especially be useful when working with character fields (such as names, e-mail addresses, and so on).
Querying different fields
In the previous section, you saw how we can easily query a person using their key. Let's take this a little further and see what happens if the following query is used:
SELECT * FROM t_test WHERE name = 'Max';
Remember that we have distributed data using the ID. In our query, however, we are searching for the name. The application will have no idea which partition to use because there is no rule telling us what is where.
As a logical consequence, the application has to ask every partition for the name
parameter. This might be acceptable if looking for the name was a real corner case; however, we cannot rely on this fact. Requiring to ask many servers instead of one is clearly a serious deoptimization, and therefore, not acceptable.
We have two ways to approach the problem: coming up with a cleverer partitioning function, or storing the data redundantly.
Coming up with a cleverer partitioning function would surely be the best option, but it is rarely possible if you want to query different fields.
This leaves us with the second option, which is storing data redundantly. Storing a set of data twice, or even more often, is not too uncommon, and it's actually a good way to approach the problem. The following diagram shows how this can be done:
As you can see, we have two clusters in this scenario. When a query comes in, the system has to decide which data can be found on which node. For cases where the name is queried, we have (for the sake of simplicity) simply split the data into half alphabetically. In the first cluster, however, our data is still split by user ID.
Pros and cons of sharding
One important thing to understand is that sharding is not a simple one-way street. If someone decides to use sharding, it is essential to be aware of the upsides as well as the downsides of the technology. As always, there is no Holy Grail that magically solves all the problems of mankind out of the box without them having to think about it.
Each practical use case is different, and there is no replacement for common sense and deep thinking.
First, let's take a look at the pros of sharding:
- It has the ability to scale a system beyond one server
- It is a straightforward approach
- It is widely supported by various frameworks
- It can be combined with various other replication approaches
- It works nicely with PostgreSQL (for example, using PL/Proxy)
Light and shade tend to go together, and therefore sharding also has its downsides:
- Adding servers on the fly can be far from simple (depending on the type of partitioning function)
- Your flexibility might be seriously reduced
- Not all types of queries will be as efficient as they would be on a single server
- There is an increase in overall complexity of the setup (such as failover, resyncing, maintenance and so on)
- Backups need more planning
- You might face redundancy and additional storage requirements
- Application developers need to be aware of sharding to make sure that efficient queries are written
In Chapter 13, Scaling with PL/Proxy, we will discuss how you can efficiently use sharding along with PostgreSQL, and how to set up PL/Proxy for maximum performance and scalability.
Choosing between sharding and redundancy
Learning how to shard a table is only the first step to designing a scalable system's architecture. In the example we showed in the previous section, we had only one table, which could be distributed easily using a key. But what if we have more than one table? Let's assume we have two tables:
- A table called
t_user
for storing the users in our system - A table called
t_language
for storing the languages supported by our system
We might be able to partition the t_user
table nicely and split it in such a way that it can reside on a reasonable number of servers. But what about the t_language
table? Our system might support as many as 10 languages.
It can make perfect sense to shard and distribute hundreds of millions of users, but splitting up 10 languages? This is clearly useless. In addition to all this, we might need our language table on all the nodes so that we can perform joins.
One solution to the problem is simple: you need a full copy of the language table on all the nodes. This will not cause a storage-consumption-related problem because the table is so small. Of course, there are many different ways to attack the problem.
Tip
Make sure that only large tables are sharded. In the case of small tables, full replicas of the tables might just make much more sense.
Again, every case has to be thought over thoroughly.
Increasing and decreasing the size of a cluster
So far, we have always considered the size of a sharded setup to be constant. We have designed sharding in a way that allowed us to utilize a fixed number of partitions inside our cluster. This limitation might not reflect in everyday requirements. How can you really tell for certain how many nodes will be needed at the time a setup is designed? People might have a rough idea of the hardware requirements, but actually knowing how much load to expect is more art than science.
Tip
To reflect this, you have to design a system in such a way that it can be resized easily.
A commonly made mistake is that people tend to increase the size of their setup in unnecessarily small steps. Somebody might want to move from five to maybe six or seven machines. This can be tricky. Let's assume for a second that we have split data using user id % 5
as the partitioning function. What if we wanted to move to user id % 6
? This is not so easy; the problem is that we have to rebalance the data inside our cluster to reflect the new rules.
Remember that we have introduced sharding (that is, partitioning) because we have so much data and so much load that one server cannot handle the requests anymore. Now, if we come up with a strategy that requires rebalancing of data, we are already on the wrong track. You definitely don't want to rebalance 20 TB of data just to add two or three servers to your existing system.
Practically, it is a lot easier to simply double the number of partitions. Doubling your partitions does not require rebalancing of data because you can simply follow the strategy outlined here:
- Create a replica of each partition
- Delete half of the data on each partition
If your partitioning function was user id % 5
before, it should be user id % 10
afterwards. The advantage of doubling is that data cannot move between partitions. When it comes to doubling, users might argue that the size of your cluster might increase too rapidly. This is true, but if you are running out of capacity, adding 10 percent storage to your resources won't fix the problem of scalability anyway.
Instead of just doubling your cluster (which is fine for most cases), you can also give more thought to writing a more sophisticated partitioning function that leaves the old data in place but handles the more recent data more intelligently. Having time-dependent partitioning functions might cause issues of its own, but it might be worth investigating this path.
Tip
Some NoSQL systems use range partitioning to spread out data. Range partitioning means that each server has a fixed slice of data for a given time frame. This can be beneficial if you want to perform time series analysis or something similar. However, it can be counterproductive if you want to make sure that data is split evenly.
If you expect your cluster to grow, we recommend starting with more partitions than those initially necessary, and packing more than just one partition on a single server. Later on, it will be easy to move single partitions to additional hardware joining the cluster setup. Some cloud services are able to do that, but those aspects are not covered in this book.
To shrink your cluster again, you can simply apply the opposite strategy and move more than just one partition to a single server. This leaves the door for a future increase of servers wide open, and can be done fairly easily.
Consistent hashing is another approach to distributing data. This technique is widely used in NoSQL systems and allows us to extend clusters in a more flexible way. However, the same technique can be used for PostgreSQL, of course.
Let's assume we have three servers (A
, B
, and C
). What happens in consistent hashing is that an array of, say, 1,000 slots can be created. Then each server name is hashed a couple of times and entered in the array. The result might look like this:
43 B, 153 A, 190 C, 340 A, 450 C, 650 B, 890 A, 930 C, 980 B
Each value shows up multiple times. In the case of insertion, we take the input key and calculate a value. Let's assume hash (some input value) equals 58
. The result will go to server A. Why? There is no entry for 58
, so the system moves forward in the list, and the first valid entry is 153
, which points to A. If the hash value returned 900
, the data would end up on C. Again, there is no entry for 900
so the system has to move forward in the list until something is found.
If a new server is added, new values for the server will be added to the array (D might be on 32
, 560
, 940
, or so). The system has to rebalance some data, but of course, not all of the data. It is a major advantage over a simple hashing mechanism, such as a simple key % server_number
function. Reducing the amount of data to be resharded is highly important.
The main advantage of consistent hashing is that it scales a lot better than simple approaches.
Combining sharding and replication
Once data has been broken down into useful chunks, which can be handled by one server or a partition, we have to think about how to make the entire setup more reliable and fail-safe.
The more servers you have in your setup, the more likely it will be that one of those servers will be down or not available for some other reason.
Tip
Always avoid single points of failure when designing a highly scalable system.
In order to ensure maximum throughput and availability, we can again turn to redundancy. The design approach can be summed up in a simple formula, which should always be in the back of a system architect's mind:
"One is none and two is one."
One server is never enough to provide us with High Availability. Every system needs a backup system that can take over in the event of a serious emergency. By just splitting a set of data, we definitely do not improve availability. This is because we have more servers, which can fail at this point. To fix this problem, we can add replicas to each of our partitions (shards), just as is shown in the following diagram:
Each partition is a separate PostgreSQL database instance, and each of those instances can have its own replica (or replicas). Essentially, it is the same concept as you will find in a RAID 1+0 setup on the hardware side.
Keep in mind that you can choose from the entire arsenal of features and options discussed in this book (for example, synchronous and asynchronous replication). All the strategies outlined in this book can be combined flexibly. A single technique is usually not enough, so feel free to combine various technologies in different ways to achieve your goals.
Various sharding solutions
In recent years, sharding has emerged as an industry-standard solution to many scalability-related problems. Thus, many programming languages, frameworks, and products are already providing out-of-the-box support for sharding.
When implementing sharding, you can basically choose between two strategies:
- Rely on some framework or middleware
- Rely on PostgreSQL's means to solve the problem
In the following chapters, we will discuss both options briefly. This little overview is not meant to be a comprehensive guide, but rather an overview to get you started with sharding.
PostgreSQL-based sharding
PostgreSQL cannot shard data out of the box, but it has all the interfaces and means required to allow sharding through add-ons. One of these add-ons, which is widely used, is called PL/Proxy. It has been around for many years, and offers superior transparency as well as good scalability. It was originally developed by Skype to scale up their infrastructure.
The idea behind PL/Proxy is basically to use a local virtual table to hide an array of servers making up the table.
PL/Proxy will be discussed in depth in Chapter 13, Scaling with PL/Proxy.