Handling backup and recovery in PostgreSQL 10 [Tutorial]

Performing backups should be a regular task and every administrator is supposed to keep an eye on this vital stuff. Fortunately, PostgreSQL provides an easy means to create backups. In this tutorial, you will learn how to backup data by performing some simple dumps and also recover them using PostgreSQL 10.

This article is an excerpt taken from, 'Mastering PostgreSQL 10' written by Hans-Jürgen Schönig. This book highlights the newly introduced features in PostgreSQL 10, and shows you how to build better PostgreSQL applications.

Performing simple dumps

If you are running a PostgreSQL setup, there are two major methods to perform backups:

Logical dumps (extract an SQL script representing your data)

Transaction log shipping

The idea behind transaction log shipping is to archive binary changes made to the database. Most people claim that transaction log shipping is the only real way to do backups. However, in my opinion, this is not necessarily true.

Many people rely on pg_dump to simply extract a textual representation of the data. pg_dump is also the oldest method of creating a backup and has been around since the very early days of the project (transaction log shipping was added much later). Every PostgreSQL administrator will become familiar with pg_dump sooner or later, so it is important to know how it really works and what it does.

Running pg_dump

The first thing we want to do is to create a simple textual dump:

[hs@linuxpc ~]$ pg_dump test > /tmp/dump.sql

This is the most simplistic backup you can imagine. pg_dump logs into the local database instance connects to a database test and starts to extract all the data, which will be sent to stdout and redirected to the file. The beauty is that standard output gives you all the flexibility of a Unix system. You can easily compress the data using a pipe or do whatever you want.

In some cases, you might want to run pg_dump as a different user. All PostgreSQL client programs support a consistent set of command-line parameters to pass user information. If you just want to set the user, use the -U flag:

[hs@linuxpc ~]$ pg_dump -U whatever_powerful_user test > /tmp/dump.sql

The following set of parameters can be found in all PostgreSQL client programs:

... 
Connection options: 
  -d, --dbname=DBNAME database to dump 
  -h, --host=HOSTNAME database server host or 
                           socket directory 
  -p, --port=PORT database server port number 
  -U, --username=NAME connect as specified database user 
  -w, --no-password never prompt for password 
  -W, --password force password prompt (should 
                           happen automatically) 
  --role=ROLENAME do SET ROLE before dump 
...

Just pass the information you want to pg_dump, and if you have enough permissions, PostgreSQL will fetch the data. The important thing here is to see how the program really works. Basically, pg_dump connects to the database and opens a large repeatable read transaction that simply reads all the data.

Remember, repeatable read ensures that PostgreSQL creates a consistent snapshot of the data, which does not change throughout the transactions. In other words, a dump is always consistent—no foreign keys will be violated. The output is a snapshot of data as it was when the dump started. Consistency is a key factor here. It also implies that changes made to the data while the dump is running won't make it to the backup anymore.

A dump simply reads everything—therefore, there are no separate permissions to be able to dump something. As long as you can read it, you can back it up.

Also, note that the backup is by default in a textual format. This means that you can safely extract data from say, Solaris, and move it to some other CPU architecture. In the case of binary copies, that is clearly not possible as the on-disk format depends on your CPU architecture.

Passing passwords and connection information

If you take a close look at the connection parameters shown in the previous section, you will notice that there is no way to pass a password to pg_dump. You can enforce a password prompt, but you cannot pass the parameter to pg_dump using a command-line option. The reason for that is simple: the password might show up in the process table and be visible to other people. Therefore, this is not supported. The question now is: if pg_hba.conf on the server enforces a password, how can the client program provide it?

There are various means of doing that:

Making use of environment variables

Making use of .pgpass

Using service files

In this section, you will learn about all three methods.

Using environment variables

One way to pass all kinds of parameters is to use environment variables. If the information is not explicitly passed to pg_dump, it will look for the missing information in predefined environment variables. A list of all potential settings can be found here: https://www.postgresql.org/docs/10/static/libpq-envars.html.

The following overview shows some environment variables commonly needed for backups:

PGHOST: It tells the system which host to connect to

PGPORT: It defines the TCP port to be used

PGUSER: It tells a client program about the desired user

PGPASSWORD: It contains the password to be used

PGDATABASE: It is the name of the database to connect to

The advantage of these environments is that the password won't show up in the process table. However, there is more. Consider the following example:

psql  -U ... -h ... -p ... -d ...

Suppose you are a system administrator: do you really want to type a long line like that a couple of times every day? If you are working with the very same host again and again, just set those environment variables and connect with plain SQL:

[hs@linuxpc ~]$ export PGHOST=localhost
[hs@linuxpc ~]$ export PGUSER=hs
[hs@linuxpc ~]$ export PGPASSWORD=abc
[hs@linuxpc ~]$ export PGPORT=5432
[hs@linuxpc ~]$ export PGDATABASE=test
[hs@linuxpc ~]$ psql
psql (10.1)
Type "help" for help.

As you can see, there are no command-line parameters anymore. Just type psql and you are in.

All applications based on the standard PostgreSQLC-language client library (libpq) will understand these environment variables, so you cannot only use them for psql and pg_dump, but for many other applications.

Making use of .pgpass

A very common way to store login information is via the use of .pgpass files. The idea is simple: put a file called .pgpass into your home directory and put your login information there. The format is simple:

hostname:port:database:username:password

An example would be:

192.168.0.45:5432:mydb:xy:abc

PostgreSQL offers some nice additional functionality: most fields can contain *. Here is an example:

*:*:*:xy:abc

This means that on every host, on every port, for every database, the user called xy will use abc as the password. To make PostgreSQL use the .pgpass file, make sure that the right file permissions are in place:

chmod 0600 ~/.pgpass

.pgpass can also be used on a Windows system. In this case, the file can be found in the %APPDATA%postgresqlpgpass.conf path.

Using service files

However, there is not just the .pgpass file. You can also make use of service files. Here is how it works. If you want to connect to the very same servers over and over again, you can create a .pg_service.conf file. It will hold all the connection information you need.

Here is an example of a .pg_service.conf file:

Mac:~  hs$ cat .pg_service.conf 
# a sample service 
[hansservice] 
host=localhost 
port=5432 
dbname=test 
user=hs 
password=abc 
 
[paulservice] 
host=192.168.0.45 
port=5432 
dbname=xyz 
user=paul 
password=cde

To connect to one of the services, just set the environment and connect:

iMac:~ hs$ export PGSERVICE=hansservice

A connection can now be established without passing parameters to psql:

iMac:~ hs$ psql
psql (10.1)
Type "help" for help. 
test=#

Alternatively, you can use:

psql service=hansservice

Extracting subsets of data

Up to now, you have seen how to dump an entire database. However, this is not what you might wish for. In many cases, you might just want to extract a subset of tables or schemas. pg_dump can do that and provides a number of switches:

-a: It dumps only the data and does not dump the data structure

-s: It dumps only the data structure but skips the data

-n: It dumps only a certain schema

-N: It dumps everything but excludes certain schemas

-t: It dumps only certain tables

-T: It dumps everything but certain tables (this can make sense if you want to exclude logging tables and so on)

Partial dumps can be very useful to speed things up considerably.

Handling various formats

So far, you have seen that pg_dump can be used to create text files. The problem is that a text file can only be replayed completely. If you have saved an entire database, you can only replay the entire thing. In many cases, this is not what you want. Therefore, PostgreSQL has additional formats that also offer more functionality.

At this point, four formats are supported:

-F, --format=c|d|t|p  output file  format (custom, directory, tar, plain  text  (default))

You have already seen plain, which is just normal text. On top of that, you can use a custom format. The idea behind a custom format is to have a compressed dump, including a table of contents. Here are two ways to create a custom format dump:

[hs@linuxpc ~]$ pg_dump -Fc test > /tmp/dump.fc
[hs@linuxpc ~]$ pg_dump -Fc test -f /tmp/dump.fc

In addition to the table of contents, the compressed dump has one more advantage: it is a lot smaller. The rule of thumb is that a custom format dump is around 90% smaller than the database instance you are about to back up. Of course, this highly depends on the number of indexes and all that, but for many database applications, this rough estimation will hold true.

Once you have created the backup, you can inspect the backup file:

[hs@linuxpc ~]$ pg_restore --list /tmp/dump.fc
;
; Archive created at 2017-11-04 15:44:56 CET
;   dbname: test
;   TOC Entries: 18
;   Compression: -1
;   Dump Version: 1.12-0
;   Format: CUSTOM
;   Integer: 4 bytes
;   Offset: 8 bytes
;   Dumped from database version: 10.1
;   Dumped by pg_dump version: 10.1
;
; Selected TOC Entries:
;
3103;  1262  16384  DATABASE - test  hs
3; 2615  2200  SCHEMA - public hs
3104;  0 0 COMMENT - SCHEMA public hs
1; 3079  13350  EXTENSION - plpgsql
3105;  0 0 COMMENT - EXTENSION plpgsql
187;  1259  16391  TABLE  public t_test hs
...

pg_restore --list will return the table of contents of the backup.

Using a custom format is a good idea as the backup will shrink in size. However, there is more; the -Fd command will create a backup in a directory format. Instead of a single file, you will now get a directory containing a couple of files:

[hs@linuxpc ~]$ mkdir /tmp/backup
[hs@linuxpc ~]$ pg_dump -Fd test -f /tmp/backup/
[hs@linuxpc ~]$ cd /tmp/backup/
[hs@linuxpc backup]$ ls -lh total  86M
-rw-rw-r--. 1 hs hs   85M Jan   4 15:54  3095.dat.gz
-rw-rw-r--. 1 hs hs   107 Jan   4 15:54  3096.dat.gz
-rw-rw-r--. 1 hs hs 740K  Jan   4 15:54  3097.dat.gz
-rw-rw-r--. 1 hs hs   39 Jan   4 15:54  3098.dat.gz
-rw-rw-r--. 1 hs hs 4.3K  Jan   4 15:54  toc.dat

One advantage of the directory format is that you can use more than one core to perform the backup. In the case of a plain or custom format, only one process will be used by pg_dump. The directory format changes that rule. The following example shows how you can tell pg_dump to use four cores (jobs):

[hs@linuxpc backup]$ rm -rf *
[hs@linuxpc backup]$ pg_dump -Fd test -f /tmp/backup/ -j 4

Note that the more objects you have in your database, the more potential speedup there will be.

To summarize, you learned about creating backups in general. If you've enjoyed reading this post, do check out 'Mastering PostgreSQL 10' to know how to replay backup and handle global data in PostgreSQL10. You will learn how to use PostgreSQL onboard tools to replicate instances.