Stepping into TAR backups (Intermediate)
For a very long time, the Unix tar
command was one of the only methods for obtaining a full binary backup of a PostgreSQL database. This is still the case for more advanced installations which may make use of filesystem snapshots, extensively utilize tablespaces, or otherwise disrupt the included management tools. For these advanced scenarios and more, tar
is indispensable for circumventing or augmenting the provided tools.
Getting ready
Please refer to the Getting a basic export (Simple) recipe on preparing a sample
database.
For the purposes of this example, we will assume that the database directory is /db
, and the archived files will go to /archive
. Based on this, we need to modify the postgresql.conf
file to archive transaction logs during the backup. Change the following configuration variables:
archive_mode = on archive_command = 'test -f /archive/%f || cp %p /archive/%f'
After PostgreSQL is restarted, the database will be ready for a tar
backup.
How to do it...
Creating a tar
backup is done with the following three basic steps, plus a fourth set of commands that are considered as good practice:
First, tell PostgreSQL to enter
backup
mode:$> psql -c "SELECT pg_start_backup('label');" postgres
Next, we produce the actual backup with
tar
, instead ofpg_dump
orpg_basebackup
:$> tar -c -z -f backup.tar.gz /db
Finally, we tell the database to end
backup
mode:$> psql -c "SELECT pg_stop_backup();" postgres
We also need the transaction logs archived during the backup. Type these commands as shown:
$> recent=$(ls -r /archive/*.backup | head -1) $> bstart=$(grep 'START WAL' $recent | sed 's/.* //; s/)//;') $> echo $bstart > /tmp/MANIFEST $> echo $recent >> /tmp/MANIFEST $> find /archive -newer /archive/$bstart \ ! -newer $recent >> /tmp/FILES $> sed 's%/.*/%%' /tmp/MANIFEST | sort | uniq \ > /archive/MANIFEST $> tar -C /archive -cz -f archive.tar.gz \ –files-from=/archive/MANIFEST
Obviously, much of this can (and should) be scripted. These commands were designed for a standard Linux system. If you are using BSD or another variant, you many need to convert them before doing this yourself.
How it works...
The tar
command for creating a backup itself is fairly simple: creating (-c
) a .gzip
compressed (-z
) file named backup.tar.gz
from the contents of /db
, wherever our database lives. Of course, these data files are likely to be changing while they're being backed up, because the process itself can take minutes or hours depending on the size of the database.
Because of this, we call pg_start_backup
to start the backup process. To begin with, it will commit pending writes to the database files (checkpoint). Afterwards, it will continue normal operation, but will also keep track of which transaction files were produced during the backup. This is important for future restores.
Next we invoke pg_stop_backup
to complete the backup. This command not only finishes the backup, but also creates a file with a .backup
extension that identifies the first and last archive logs necessary to restore the database to full working order. We need the first, last, and every transaction log in between to restore, which is what the last set of commands is for.
Knowing that the most recent .backup
file archived by the database contains this information, we parse it using various Unix commands to identify every file between the first marked archive log, and the end of the backup itself. No file is older than the .backup
file. All of these files are required to fully restore the database, and the process itself is fairly complicated.
We highly recommend implementing a more robust and tested version of the outlined steps, or using a preexisting backup library or third-party tool. For example, OmniPITR is often recommended. Our quick and dirty method works, but it should be fairly obvious why pg_basebackup
automates and abstracts away most of the complexity in our example. We gain flexibility here, not ease of use.
There's more...
Now we should discuss exactly what kind of flexibility we may gain.
Parallel compression
Compressing files is very CPU intensive; pigz
and pbzip2
are still very handy, and tar
works very well with external utilities. We can alter the archival command for the /db
directory from the previous commands with the -I
flag to choose our own compression program, as shown in the following command:
$> tar -c -I pigz -f backup.tar.gz /db
Alternatively, since pigz
can take parameters for choosing the number of threads, or because certain versions of tar
don't support the -I
flag, we can send the output of tar
to pigz
instead by using the following command:
$> tar -c /db | pigz -p 4 > backup.tar.gz
Unlike pg_basebackup
, these tar
commands work with complex databases that make extensive use of tablespaces. Each tablespace can be handled separately and compressed in parallel, drastically reducing compression time.
Some may argue that pg_basebackup
does support tablespaces, and it does create .tar.gz
files for every user-created tablespace in the database along with base.tar.gz
. However, the tar
output format of this tool will not stream to standard output if there are user-created tablespaces. This means that our trick of capturing the stream with pigz
would not work in such advanced systems. Hence, we used tar
in this example.
Making a tar backup standby-ready
With PostgreSQL, a database in standby or streaming mode will not have its own transaction logs while recovery is in progress, since it uses some other source of archived transaction logs to apply changes to the database. This means that backing these files up is often excessive. Remember, we mentioned that pg_basebackup
omits them by default for similar reasons. Thankfully, tar
can also exclude them, or any other paths. Again, we will modify the /db
backup command as follows:
$> tar -c -z -f backup.tar.gz --exclude=pg_xlog /db
Now, if the backup.tar.gz
file is uncompressed, it can only be used for standby or streaming replication.
Backing up tablespaces
We keep talking about tablespaces, but how do we handle them? PostgreSQL tablespaces reside in the pg_tblspc
subdirectory of the database
instance. To back these up separately, we want to tell tar
to ignore them using the following command:
$> tar -c -z -f backup.tar.gz --exclude=pg_tblspc /db
Now we can back up all of the tablespaces that live in that directory. Something like the following bash snippet could do the job nicely:
for f in $(find /db/pg_tblspc); do d=$(readlink -f $f) n=$(basename $f) tar -czf /backup/$n.tar.gz -C $d . done
Each tablespace will be compressed in a separate file named after its database object identifier. Incidentally, this is exactly what pg_basebackup
would do. But we can alter any of these commands as much as we desire. For instance, to introduce parallel compression we can use the following shell code:
for f in $(find /db/pg_tblspc); do d=$(readlink -f $f) n=$(basename $f) tar -C $d . -c | pigz -p 8 > /backup/$n.tar.gz done
Now we can do something pg_basebackup
can't, and use parallel compression on all tablespaces in the database instead of just the default compression. That is just one example of what we can modify within the backup process itself. It is a bit more work, but scripts can automate most or all of these extra steps.
Backing up archived transaction logs
Since our database is in archive
mode, it may be producing archived transaction logs at a precipitous rate. If we want to do point in time recovery (PITR) or certain types of database standby, it is generally a good idea to preserve these to tape as well. For example, if executed at midnight, the following snippet of bash would compress the previous day's archived logs:
find /archive/ -type f -name '0000*' -mtime +0 ! \ -printf '%f\n' >> /archive/MANIFEST tar -C /archive/ --files-from=/archive/MANIFEST -c -I pigz \ -f /backup/axlog.$(date -d '-1 day' +"%Y-%m-%d").tar.gz find /archive/ -type f -name '0000*' -mtime +0 -delete
Now these backup files can be stored elsewhere for PITR restores. For very important databases, being able to use old backup files and transaction logs means that the database can be restored to any previous time since its creation. Always be sure to have access to old archive logs if this is a desired option.