Performing actions on many tables
As a database administrator, you will often need to apply multiple commands as part of the same overall task. This task could be one of the following:
- Performing many different actions on multiple tables
- Performing the same action on multiple tables
- Performing the same action on multiple tables in parallel
- Performing different actions, one on each table, in parallel
The first is a general case where you need to make a set of coordinated changes. The solution is to write a script, as we’ve already discussed. We can also call this static scripting because you write the script manually and then execute it.
The second type of task can be achieved very simply with dynamic scripts, where we write a script that writes another script. This technique is the main topic of this recipe.
Performing actions in parallel sounds cool, and it would be useful if it were easy. In some ways, it is, but trying to run multiple tasks concurrently and trap and understand all the errors is much harder. And if you’re thinking it won’t matter if you don’t check for errors, think again. If you run tasks in parallel, then you cannot run them inside the same transaction, so you need error handling in case one part fails.
Don’t worry! Running in parallel is usually not as bad as it may seem after reading the previous paragraph, and we’ll explain it after looking at a few basic examples.
Getting ready
Let’s create a basic schema to run some examples on:
postgres=# create schema test;
CREATE SCHEMA
postgres=# create table test.a (col1 INTEGER);
CREATE TABLE
postgres=# create table test.b (col1 INTEGER);
CREATE TABLE
postgres=# create table test.c (col1 INTEGER);
CREATE TABLE
How to do it…
Our task is to run a SQL statement using this form, with X
as the table name, against each of our three test tables:
ALTER TABLE X
ADD COLUMN last_update_timestamp TIMESTAMP WITH TIME ZONE DEFAULT current_timestamp;
The steps are as follows:
- Our starting point is a script that lists the tables that we want to perform tasks against – something like the following:
postgres=# SELECT n.nspname, c.relname FROM pg_class c JOIN pg_namespace n ON c.relnamespace = n.oid WHERE n.nspname = 'test' AND c.relkind = 'r';
- This displays the list of tables that we will act upon (so that you can check it):
relname --------- a b c (3 rows)
- We can then use the preceding SQL to generate the text for a SQL script, substituting the schema name and table name in the SQL text:
postgres=# SELECT format('ALTER TABLE %I.%I ADD COLUMN last_update_timestamp TIMESTAMP WITH TIME ZONE DEFAULT current_timestamp;' , n.nspname, c.relname ) FROM pg_class c JOIN pg_namespace n ON c.relnamespace = n.oid WHERE n.nspname = 'test' AND c.relkind = 'r';
- Finally, we can run the script and watch the results (success!):
postgres=# \gexec ALTER TABLE ALTER TABLE ALTER TABLE
How it works…
Overall, this is just an example of dynamic scripting, and it has been used by DBAs for many decades, even before PostgreSQL was born.
The \gexec
command means to execute the results of the query, so be very careful that you test your query before you run it in production.
The format function takes a template string as its first argument and replaces all occurrences of %I
with the values supplied as additional arguments (in our case, the values of n.nspname
and r.relname
).
%I
treats the value as a SQL identifier, adding double quotes as appropriate. This is extremely important if some joker or attacker creates a table like this:
postgres=# create table test."; DROP TABLE customer;" (col1 INTEGER);
If the script used just %s
rather than %I
, then the script will generate this SQL, which will result in you dropping the customer table if it exists. So, for security purposes, you should use %I
:
ALTER TABLE test.a ADD COLUMN last_update_timestamp TIMESTAMP WITH TIME ZONE DEFAULT current_timestamp;
ALTER TABLE test.; drop table customer; ADD COLUMN last_update_timestamp TIMESTAMP WITH TIME ZONE DEFAULT current_timestamp;
ALTER TABLE test.b ADD COLUMN last_update_timestamp TIMESTAMP WITH TIME ZONE DEFAULT current_timestamp;
ALTER TABLE test.c ADD COLUMN last_update_timestamp TIMESTAMP WITH TIME ZONE DEFAULT current_timestamp;
Dynamic scripting can also be called a quick and dirty approach. The previous scripts didn’t filter out views and other objects in the test schema, so you’ll need to add that yourself, or not, as required.
There is another way of doing this as well:
DO $$
DECLARE t record;
BEGIN
FOR t IN SELECT c.*, n.nspname
FROM pg_class c JOIN pg_namespace n
ON c.relnamespace = n.oid
WHERE n.nspname = 'test'
AND c.relkind = 'r' /* ; not needed */
LOOP
EXECUTE format(
'ALTER TABLE %I.%I
ADD COLUMN last_update_timestamp
TIMESTAMP WITH TIME ZONE'
, t.nspname, t.relname);
END LOOP;
END $$;
I don’t prefer using this method because it executes the SQL directly and doesn’t allow you to review it before, or keep the script afterward.
The preceding syntax with DO
is called an anonymous code block because it’s like a function without a name.
There’s more…
Earlier, I said I’d explain how to run multiple tasks in parallel. Some practical approaches to this are possible, with a bit of discussion.
Making tasks run in parallel can be thought of as subdividing the main task so that we run x2, x4, x8, and other subscripts, rather than one large script.
First, you should note that error-checking gets worse when you spawn more parallel tasks, whereas performance improves the most for the first few subdivisions. Also, we’re often constrained by CPU, RAM, or I/O resources for intensive tasks. This means that splitting the main task into two to four parallel subtasks isn’t practical without some kind of tool to help us manage them.
There are two approaches here, depending on the two types of tasks:
- A task consists of many smaller tasks, all roughly of the same size.
- A task consists of many smaller tasks, and the execution times vary according to the size and complexity of the database object.
If we have lots of smaller tasks, then we can simply run our scripts multiple times using a simple round-robin split so that each subscript runs a part of all the subtasks. Here is how to do it: each row in pg_class
has a hidden column called oid
, whose value is a 32-bit number that’s allocated from an internal counter on table creation. Therefore, about half of the tables will have even values of oid
, and we can achieve an even split by adding the following clauses:
- Script 1: Add
WHERE c.oid % 2 = 0
- Script 2: Add
WHERE c.oid % 2 = 1
Here, we added a column to many tables. In the previous example, we were adding the column with no specified default; so, the new column will have a NULL
value, and as a result, it will run very quickly with ALTER TABLE
, even on large tables. If we change the ALTER TABLE
statement to specify a default, then we should choose a non-volatile expression for the default value; otherwise, PostgreSQL will need to rewrite the entire table. So, the runtime will vary according to the table’s size (approximately, and also according to the number and type of indexes).
Now that our subtasks vary at runtime according to their size, we need to be more careful when splitting the subtasks so that we end up with multiple scripts that will run for about the same time.
If we already know that we have just a few big tables, it’s easy to split them manually into scripts.
If the database contains many large tables, then we can sort SQL statements by table size and then distribute them using round-robin distribution into multiple subscripts that will have approximately the same runtime. The following is an example of this technique, which assumes you have multiple large tables in a schema called test.
First, create a table with all the SQL you would like to run:
CREATE TABLE run_sql AS
SELECT format('ALTER TABLE %I.%I ADD COLUMN
last_update_timestamp TIMESTAMP WITH TIME ZONE
DEFAULT now();' , n.nspname, c.relname) as sql,
row_number() OVER (ORDER BY pg_relation_size(c.oid))
FROM pg_class c
JOIN pg_namespace n
ON c.relnamespace = n.oid
WHERE n.nspname = 'test'
AND c.relkind = 'r';
Then, create a file called exec-script.sql
and place the following code in it:
SELECT sql FROM run_sql
WHERE row_number % 2 = :i
ORDER BY row_number DESC
\gexec
Then, we run the script twice, as follows:
$ psql -v i=0 -f make-script.sql &
$ psql -v i=1 -f make-script.sql &
Note how we used the psql
parameters – via the -v
command-line option – to select different rows using the same script.
Also, note how we used the row_number()
window function to sort the data by size. Then, we split the data into pieces using the following line:
WHERE row_number % N = i;
Here, N
is the total number of scripts we’re producing, and i
ranges between 0 and N
minus 1 (we are using modulo arithmetic to distribute the subtasks).