The term Scalability means the ability of a software system to grow as the business using it grows. PostgreSQL provides some features that help you to build a scalable solution but, strictly speaking, PostgreSQL itself is not scalable. It can effectively utilize the following resources of a single machine:
It uses multiple CPU cores to execute a single query faster with the parallel query feature
When configured properly, it can use all available memory for caching
The size of the database is not limited; PostgreSQL can utilize multiple hard disks when multiple tablespaces are created; with partitioning, the hard disks could be accessed simultaneously, which makes data processing faster
However, when it comes to spreading a database solution to multiple machines, it can be quite problematic because a standard PostgreSQL server can only run on a single machine. In this article, we will look at different scaling scenarios and their implementation in PostgreSQL.
The requirement for a system to be scalable means that a system that supports a business now, should also be able to support the same business with the same quality of service as it grows.
This article is an excerpt taken from the book 'Learning PostgreSQL 11 - Third Edition' written by Andrey Volkov and Salahadin Juba. The book explores the concepts of relational databases and their core principles. You’ll get to grips with using data warehousing in analytical solutions and reports and scaling the database for high availability and performance.
Let's say a database can store 1 GB of data and effectively process 100 queries per second. What if with the development of the business, the amount of data being processed grows 100 times? Will it be able to support 10,000 queries per second and process 100 GB of data? Maybe not now, and not in the same installation. However, a scalable solution should be ready to be expanded to be able to handle the load as soon as it is needed.
In scenarios where it is required to achieve better performance, it is quite common to set up more servers that would handle additional load and copy the same data to them from a master server. In scenarios where high availability is required, this is also a typical solution to continuously copy the data to a standby server so that it could take over in case the master server crashes.
Scalable PostgreSQL solution
Replication can be used in many scaling scenarios. Its primary purpose is to create and maintain a backup database in case of system failure. This is especially true for physical replication. However, replication can also be used to improve the performance of a solution based on PostgreSQL. Sometimes, third-party tools can be used to implement complex scaling scenarios.
Scaling for heavy querying
Imagine there's a system that's supposed to handle a lot of read requests. For example, there could be an application that implements an HTTP API endpoint that supports the auto-completion functionality on a website. Each time a user enters a character in a web form, the system searches in the database for objects whose name starts with the string the user has entered. The number of queries can be very big because of the large number of users, and also because several requests are processed for every user session. To handle large numbers of requests, the database should be able to utilize multiple CPU cores. In case the number of simultaneous requests is really large, the number of cores required to process them can be greater than a single machine could have.
The same applies to a system that is supposed to handle multiple heavy queries at the same time. You don't need a lot of queries, but when the queries themselves are big, using as many CPUs as possible would offer a performance benefit—especially when parallel query execution is used.
In such scenarios, where one database cannot handle the load, it's possible to set up multiple databases, set up replication from one master database to all of them, making each them work as a hot standby, and then let the application query different databases for different requests. The application itself can be smart and query a different database each time, but that would require a special implementation of the data-access component of the application, which could look as follows:
Another option is to use a tool called Pgpool-II, which can work as a load-balancer in front of several PostgreSQL databases. The tool exposes a SQL interface, and applications can connect there as if it were a real PostgreSQL server. Then Pgpool-II will redirect the queries to the databases that are executing the fewest queries at that moment; in other words, it will perform load-balancing:
Yet another option is to scale the application together with the databases so that one instance of the application will connect to one instance of the database. In that case, the users of the application should connect to one of the many instances. This can be achieved with HTTP load-balancing:
Data sharding
When the problem is not the number of concurrent queries, but the size of the database and the speed of a single query, a different approach can be implemented. The data can be separated into several servers, which will be queried in parallel, and then the result of the queries will be consolidated outside of those databases. This is called data sharding.
PostgreSQL provides a way to implement sharding based on table partitioning, where partitions are located on different servers and another one, the master server, uses them as foreign tables. When performing a query on a parent table defined on the master server, depending on the WHERE clause and the definitions of the partitions, PostgreSQL can recognize which partitions contain the data that is requested and would query only these partitions. Depending on the query, sometimes joins, grouping and aggregation could be performed on the remote servers. PostgreSQL can query different partitions in parallel, which will effectively utilize the resources of several machines. Having all this, it's possible to build a solution when applications would connect to a single database that would physically execute their queries on different database servers depending on the data that is being queried.
It's also possible to build sharding algorithms into the applications that use PostgreSQL. In short, applications would be expected to know what data is located in which database, write it only there, and read it only from there. This would add a lot of complexity to the applications.
Another option is to use one of the PostgreSQL-based sharding solutions available on the market or open source solutions. They have their own pros and cons, but the common problem is that they are based on previous releases of PostgreSQL and don't use the most recent features (sometimes providing their own features instead).
One of the most popular sharding solutions is Postgres-XL, which implements a shared-nothing architecture using multiple servers running PostgreSQL. The system has several components:
Multiple data nodes: Store the data
A single global transaction monitor (GTM): Manages the cluster, provides global transaction consistency
Multiple coordinator nodes: Supports user connections, builds query-execution plans, and interacts with the GTM and the data nodes
Postgres-XL implements the same API as PostgreSQL, therefore the applications don't need to treat the server in any special way. It is ACID-compliant, meaning it supports transactions and integrity constraints. The COPY command is also supported.
The main benefits of using Postgres-XL are as follows:
It can scale to support more reading operations by adding more data nodes
It can scale for to support more writing operations by adding more coordinator nodes
The current release of Postgres-XL (at the time of writing) is based on PostgreSQL 10, which is relatively new
The main downside of Postgres-XL is that it does not provide any high-availability features out of the box. When more servers are added to a cluster, the probability of the failure of any of them increases. That's why you should take care with backups or implement replication of the data nodes themselves.
Postgres-XL is open source, but commercial support is available.
Another solution worth mentioning is Greenplum. It's positioned as an implementation of a massive parallel-processing database, specifically designed for data warehouses. It has the following components:
Master node: Manages user connections, builds query execution plans, manages transactions
Data nodes: Store the data and perform queries
Greenplum also implements the PostgreSQL API, and applications can connect to a Greenplum database without any changes. It supports transactions, but support for integrity constraints is limited. The COPY command is supported.
The main benefits of Greenplum are as follows:
It can scale to support more reading operations by adding more data nodes.
It supports column-oriented table organization, which can be useful for data-warehousing solutions.
Data compression is supported.
High-availability features are supported out of the box. It's possible (and recommended) to add a secondary master that would take over in case a primary master crashes. It's also possible to add mirrors to the data nodes to prevent data loss.
The drawbacks are as follows:
It doesn't scale to support more writing operations. Everything goes through the single master node and adding more data nodes does not make writing faster. However, it's possible to import data from files directly on the data nodes.
It uses PostgreSQL 8.4 in its core. Greenplum has a lot of improvements and new features added to the base PostgreSQL code, but it's still based on a very old release; however, the system is being actively developed.
Greenplum doesn't support foreign keys, and support for unique constraints is limited.
There are commercial and open source editions of Greenplum.
Scaling for many numbers of connections
Yet another use case related to scalability is when the number of database connections is great. However, when a single database is used in an environment with a lot of microservices and each has its own connection pool, even if they don't perform too many queries, it's possible that hundreds or even thousands of connections are opened in the database. Each connection consumes server resources and just the requirement to handle a great number of connections can already be a problem, without even performing any queries.
If applications don't use connection pooling and open connections only when they need to query the database and close them afterwards, another problem could occur. Establishing a database connection takes time—not too much, but when the number of operations is great, the total overhead will be significant.
There is a tool, named PgBouncer, that implements a connection-pool functionality. It can accept connections from many applications as if it were a PostgreSQL server and then open a limited number of connections towards the database. It would reuse the same database connections for multiple applications' connections. The process of establishing a connection from an application to PgBouncer is much faster than connecting to a real database because PgBouncer doesn't need to initialize a database backend process for the session.
PgBouncer can create multiple connection pools that work in one of the three modes:
Session mode: A connection to a PostgreSQL server is used for the lifetime of a client connection to PgBouncer. Such a setup could be used to speed up the connection process on the application side. This is the default mode.
Transaction mode: A connection to PostgreSQL is used for a single transaction that a client performs. That could be used to reduce the number of connections at the PostgreSQL side when only a few translations are performed simultaneously.
Statement mode: A database connection is used for a single statement. Then it is returned to the pool and a different connection is used for the next statement. This mode is similar to the transaction mode, though more aggressive. Note that multi-statement transactions are not possible when statement mode is used.
Different pools can be set up to work in different modes.
It's possible to let PgBouncer connect to multiple PostgreSQL servers, thus working as a reverse proxy.
The way PgBouncer could be used is represented in the following diagram:
PgBouncer establishes several connections to the database. When an application connects to PgBouncer and starts a transaction, PgBouncer assigns an existing database connection to that application, forwards all SQL commands to the database, and delivers the results back. When the transaction is finished, PgBouncer will dissociate the connections, but not close them. If another application starts a transaction, the same database connection could be used. Such a setup requires configuring PgBouncer to work in transaction mode.
PostgreSQL provides several ways to implement replication that would maintain a copy of the data from a database on another server or servers. This can be used as a backup or a standby solution that would take over in case the main server crashes. Replication can also be used to improve the performance of a software system by making it possible to distribute the load on several database servers.
In this article, we discussed the problem of building scalable solutions based on PostgreSQL utilizing the resources of several servers. We looked at scaling for querying, data sharding, as well as scaling for many numbers of connections. If you enjoyed reading this article and want to explore other topics, be sure to check out the book 'Learning PostgreSQL 11 - Third Edition'.
Handling backup and recovery in PostgreSQL 10 [Tutorial]
Understanding SQL Server recovery models to effectively backup and restore your database
Saving backups on cloud services with ElasticSearch plugins
Read more