Performing backups should be a regular task and every administrator is supposed to keep an eye on this vital stuff. Fortunately, PostgreSQL provides an easy means to create backups. In this tutorial, you will learn how to backup data by performing some simple dumps and also recover them using PostgreSQL 10.
This article is an excerpt taken from, 'Mastering PostgreSQL 10' written by Hans-Jürgen Schönig. This book highlights the newly introduced features in PostgreSQL 10, and shows you how to build better PostgreSQL applications.
If you are running a PostgreSQL setup, there are two major methods to perform backups:
The idea behind transaction log shipping is to archive binary changes made to the database. Most people claim that transaction log shipping is the only real way to do backups. However, in my opinion, this is not necessarily true.
Many people rely on pg_dump to simply extract a textual representation of the data. pg_dump is also the oldest method of creating a backup and has been around since the very early days of the project (transaction log shipping was added much later). Every PostgreSQL administrator will become familiar with pg_dump sooner or later, so it is important to know how it really works and what it does.
The first thing we want to do is to create a simple textual dump:
[hs@linuxpc ~]$ pg_dump test > /tmp/dump.sql
This is the most simplistic backup you can imagine. pg_dump logs into the local database instance connects to a database test and starts to extract all the data, which will be sent to stdout and redirected to the file. The beauty is that standard output gives you all the flexibility of a Unix system. You can easily compress the data using a pipe or do whatever you want.
In some cases, you might want to run pg_dump as a different user. All PostgreSQL client programs support a consistent set of command-line parameters to pass user information. If you just want to set the user, use the -U flag:
[hs@linuxpc ~]$ pg_dump -U whatever_powerful_user test > /tmp/dump.sql
The following set of parameters can be found in all PostgreSQL client programs:
... Connection options: -d, --dbname=DBNAME database to dump -h, --host=HOSTNAME database server host or socket directory -p, --port=PORT database server port number -U, --username=NAME connect as specified database user -w, --no-password never prompt for password -W, --password force password prompt (should happen automatically) --role=ROLENAME do SET ROLE before dump ...
Just pass the information you want to pg_dump, and if you have enough permissions, PostgreSQL will fetch the data. The important thing here is to see how the program really works. Basically, pg_dump connects to the database and opens a large repeatable read transaction that simply reads all the data.
Remember, repeatable read ensures that PostgreSQL creates a consistent snapshot of the data, which does not change throughout the transactions. In other words, a dump is always consistent—no foreign keys will be violated. The output is a snapshot of data as it was when the dump started. Consistency is a key factor here. It also implies that changes made to the data while the dump is running won't make it to the backup anymore.
Also, note that the backup is by default in a textual format. This means that you can safely extract data from say, Solaris, and move it to some other CPU architecture. In the case of binary copies, that is clearly not possible as the on-disk format depends on your CPU architecture.
If you take a close look at the connection parameters shown in the previous section, you will notice that there is no way to pass a password to pg_dump. You can enforce a password prompt, but you cannot pass the parameter to pg_dump using a command-line option. The reason for that is simple: the password might show up in the process table and be visible to other people. Therefore, this is not supported. The question now is: if pg_hba.conf on the server enforces a password, how can the client program provide it?
There are various means of doing that:
In this section, you will learn about all three methods.
One way to pass all kinds of parameters is to use environment variables. If the information is not explicitly passed to pg_dump, it will look for the missing information in predefined environment variables. A list of all potential settings can be found here: https://www.postgresql.org/docs/10/static/libpq-envars.html.
The following overview shows some environment variables commonly needed for backups:
The advantage of these environments is that the password won't show up in the process table. However, there is more. Consider the following example:
psql -U ... -h ... -p ... -d ...
Suppose you are a system administrator: do you really want to type a long line like that a couple of times every day? If you are working with the very same host again and again, just set those environment variables and connect with plain SQL:
[hs@linuxpc ~]$ export PGHOST=localhost [hs@linuxpc ~]$ export PGUSER=hs [hs@linuxpc ~]$ export PGPASSWORD=abc [hs@linuxpc ~]$ export PGPORT=5432 [hs@linuxpc ~]$ export PGDATABASE=test [hs@linuxpc ~]$ psql psql (10.1) Type "help" for help.
As you can see, there are no command-line parameters anymore. Just type psql and you are in.
A very common way to store login information is via the use of .pgpass files. The idea is simple: put a file called .pgpass into your home directory and put your login information there. The format is simple:
hostname:port:database:username:password
An example would be:
192.168.0.45:5432:mydb:xy:abc
PostgreSQL offers some nice additional functionality: most fields can contain *. Here is an example:
*:*:*:xy:abc
This means that on every host, on every port, for every database, the user called xy will use abc as the password. To make PostgreSQL use the .pgpass file, make sure that the right file permissions are in place:
chmod 0600 ~/.pgpass
.pgpass can also be used on a Windows system. In this case, the file can be found in the %APPDATA%postgresqlpgpass.conf path.
However, there is not just the .pgpass file. You can also make use of service files. Here is how it works. If you want to connect to the very same servers over and over again, you can create a .pg_service.conf file. It will hold all the connection information you need.
Here is an example of a .pg_service.conf file:
Mac:~ hs$ cat .pg_service.conf # a sample service [hansservice] host=localhost port=5432 dbname=test user=hs password=abc [paulservice] host=192.168.0.45 port=5432 dbname=xyz user=paul password=cde
To connect to one of the services, just set the environment and connect:
iMac:~ hs$ export PGSERVICE=hansservice
A connection can now be established without passing parameters to psql:
iMac:~ hs$ psql psql (10.1) Type "help" for help. test=#
Alternatively, you can use:
psql service=hansservice
Up to now, you have seen how to dump an entire database. However, this is not what you might wish for. In many cases, you might just want to extract a subset of tables or schemas. pg_dump can do that and provides a number of switches:
Partial dumps can be very useful to speed things up considerably.
So far, you have seen that pg_dump can be used to create text files. The problem is that a text file can only be replayed completely. If you have saved an entire database, you can only replay the entire thing. In many cases, this is not what you want. Therefore, PostgreSQL has additional formats that also offer more functionality.
At this point, four formats are supported:
-F, --format=c|d|t|p output file format (custom, directory, tar, plain text (default))
You have already seen plain, which is just normal text. On top of that, you can use a custom format. The idea behind a custom format is to have a compressed dump, including a table of contents. Here are two ways to create a custom format dump:
[hs@linuxpc ~]$ pg_dump -Fc test > /tmp/dump.fc [hs@linuxpc ~]$ pg_dump -Fc test -f /tmp/dump.fc
In addition to the table of contents, the compressed dump has one more advantage: it is a lot smaller. The rule of thumb is that a custom format dump is around 90% smaller than the database instance you are about to back up. Of course, this highly depends on the number of indexes and all that, but for many database applications, this rough estimation will hold true.
Once you have created the backup, you can inspect the backup file:
[hs@linuxpc ~]$ pg_restore --list /tmp/dump.fc ; ; Archive created at 2017-11-04 15:44:56 CET ; dbname: test ; TOC Entries: 18 ; Compression: -1 ; Dump Version: 1.12-0 ; Format: CUSTOM ; Integer: 4 bytes ; Offset: 8 bytes ; Dumped from database version: 10.1 ; Dumped by pg_dump version: 10.1 ; ; Selected TOC Entries: ; 3103; 1262 16384 DATABASE - test hs 3; 2615 2200 SCHEMA - public hs 3104; 0 0 COMMENT - SCHEMA public hs 1; 3079 13350 EXTENSION - plpgsql 3105; 0 0 COMMENT - EXTENSION plpgsql 187; 1259 16391 TABLE public t_test hs ...
pg_restore --list will return the table of contents of the backup.
Using a custom format is a good idea as the backup will shrink in size. However, there is more; the -Fd command will create a backup in a directory format. Instead of a single file, you will now get a directory containing a couple of files:
[hs@linuxpc ~]$ mkdir /tmp/backup [hs@linuxpc ~]$ pg_dump -Fd test -f /tmp/backup/ [hs@linuxpc ~]$ cd /tmp/backup/ [hs@linuxpc backup]$ ls -lh total 86M -rw-rw-r--. 1 hs hs 85M Jan 4 15:54 3095.dat.gz -rw-rw-r--. 1 hs hs 107 Jan 4 15:54 3096.dat.gz -rw-rw-r--. 1 hs hs 740K Jan 4 15:54 3097.dat.gz -rw-rw-r--. 1 hs hs 39 Jan 4 15:54 3098.dat.gz -rw-rw-r--. 1 hs hs 4.3K Jan 4 15:54 toc.dat
One advantage of the directory format is that you can use more than one core to perform the backup. In the case of a plain or custom format, only one process will be used by pg_dump. The directory format changes that rule. The following example shows how you can tell pg_dump to use four cores (jobs):
[hs@linuxpc backup]$ rm -rf * [hs@linuxpc backup]$ pg_dump -Fd test -f /tmp/backup/ -j 4
Note that the more objects you have in your database, the more potential speedup there will be.
To summarize, you learned about creating backups in general. If you've enjoyed reading this post, do check out 'Mastering PostgreSQL 10' to know how to replay backup and handle global data in PostgreSQL10. You will learn how to use PostgreSQL onboard tools to replicate instances.