Managing HA in PostgreSQL is very important to ensure that database clusters maintain exceptional uptime and strong operational performance so that your data is always available to the application.
Master-slave might be the most basic and easiest HA architecture for developers. It is based on one master database with one or more standby servers. These standby databases will remain synchronized (or almost synchronized) with the master, depending on whether the replication is synchronous or asynchronous.
It is important to understand that PostgreSQL does not have a native mechanism to control failovers, that is, when the master fails automatically, the standby server becomes the new master with downtime as close to 0 as possible. To perform this procedure, there are third-party tools such as repmgr, pgpool-II, or patroni, to name a few.
These tools are placed in a layer above PostgreSQL, and they control the health status of the master; when a problem occurs, these tools fire a series of actions to promote the standby server as the new master.
There are several ways to classify a standby database:
- By the nature of the replication:
- Physical standbys: Disk blocks are copied.
- Logical standbys: The streaming of the data changes.
- By the synchronicity of the transactions:
- Asynchronous: There is a possibility of data loss.
- Synchronous: There is no possibility of data loss; the commits in the master wait for the response of the standby.
- By usage:
- Hot standbys: Support read-only connections; the hot standbys are configured for synchronous-commit mode, so their master server must wait for the hot standbys to confirm that they have consolidated the transaction log (when a commit statement is performed on the master, the progress of the commit will only be completed after all of the hot standbys have finished consolidating the transaction log).
- Async standbys: These kinds of servers are configured by asynchronous-commit mode; therefore, the master server will not wait for the async standbys to consolidate the transaction log (when a commit statement is performed on the master, the progress of the commit will not wait for these async standbys to consolidate).
PostgreSQL uses a stream of WAL records to synchronize the standby databases. They can be synchronous or asynchronous, and the entire database server is replicated.
However, a master-slave setup is not enough to effectively ensure HA, as we also need to handle failures. To handle failures, we need to be able to detect them. Once we know there is a failure, for example, errors on the master, or the master is not responding, then we can select a slave and failover mechanism to it with the smallest amount of delay possible. It is important that this process is as efficient as possible, in order to restore full functionality so that the applications can start functioning again. PostgreSQL itself does not include an automatic failover mechanism, so it will require some custom script or third-party tools for this automation.
After a failover happens, the applications need to be notified accordingly so that they can start using the new master. Additionally, we need to evaluate the state of our architecture after a failover because we can run into a situation where we only have the new master running (for instance, we had a master and only one slave before the issue). In that case, we will need to somehow add a slave so as to recreate the master-slave setup we originally had for HA.