First, we will learn the old method to partition data.
Before digging deeper into the advantages of partitioning, I want to show how partitions can be created. The entire thing starts with a parent table:
test=# CREATE TABLE t_data (id serial, t date, payload text); CREATE TABLE
In this example, the parent table has three columns. The date column will be used for partitioning but more on that a bit later.
Now that the parent table is in place, the child tables can be created. This is how it works:
test=# CREATE TABLE t_data_2016 () INHERITS (t_data); CREATE TABLE
test=# d t_data_2016
Table "public.t_data_2016" Column | Type | Modifiers
---------+---------+----------------------------------------------------- id | integer | not null default
nextval('t_data_id_seq'::regclass)
t | date |
payload | text | Inherits: t_data
The table is called t_data_2016 and inherits from t_data. () means that no extra columns are added to the child table. As you can see, inheritance means that all columns from the parents are available in the child table. Also note that the id column will inherit the sequence from the parent so that all children can share the very same numbering.
Let's create more tables:
test=# CREATE TABLE t_data_2015 () INHERITS (t_data); CREATE TABLE
test=# CREATE TABLE t_data_2014 () INHERITS (t_data); CREATE TABLE
So far, all the tables are identical and just inherit from the parent. However, there is more: child tables can actually have more columns than parents. Adding more fields is simple:
test=# CREATE TABLE t_data_2013 (special text) INHERITS (t_data); CREATE TABLE
In this case, a special column has been added. It has no impact on the parent, but just enriches the children and makes them capable of holding more data.
After creating a handful of tables, a row can be added:
test=# INSERT INTO t_data_2015 (t, payload) VALUES ('2015-05-04', 'some data');
INSERT 0 1
The most important thing now is that the parent table can be used to find all the data in the child tables:
test=# SELECT * FROM t_data;
id | t | payload
----+------------+-----------
1 | 2015-05-04 | some data
(1 row)
Querying the parent allows you to gain access to everything below the parent in a simple and efficient manner.
To understand how PostgreSQL does partitioning, it makes sense to take a look at the plan:
test=# EXPLAIN SELECT * FROM t_data; QUERY PLAN
----------------------------------------------------------------- Append (cost=0.00..84.10 rows=4411 width=40)
-> Seq Scan on t_data (cost=0.00..0.00 rows=1 width=40)
-> Seq Scan on t_data_2016
(cost=0.00..22.00 rows=1200 width=40)
-> Seq Scan on t_data_2015
(cost=0.00..22.00 rows=1200 width=40)
-> Seq Scan on t_data_2014
(cost=0.00..22.00 rows=1200 width=40)
-> Seq Scan on t_data_2013 (cost=0.00..18.10 rows=810 width=40)
(6 rows)
Actually, the process is quite simple. PostgreSQL will simply unify all tables and show us all the content from all the tables inside and below the partition we are looking at. Note that all tables are independent and are just connected logically through the system catalog.
What happens if filters are applied?
test=# EXPLAIN SELECT * FROM t_data WHERE t = '2016-01-04'; QUERY PLAN
----------------------------------------------------------------- Append (cost=0.00..95.12 rows=23 width=40)
-> Seq Scan on t_data (cost=0.00..0.00 rows=1 width=40) Filter: (t = '2016-01-04'::date)
-> Seq Scan on t_data_2016 (cost=0.00..25.00 rows=6 width=40) Filter: (t = '2016-01-04'::date)
-> Seq Scan on t_data_2015 (cost=0.00..25.00 rows=6 width=40) Filter: (t = '2016-01-04'::date)
-> Seq Scan on t_data_2014 (cost=0.00..25.00 rows=6 width=40) Filter: (t = '2016-01-04'::date)
-> Seq Scan on t_data_2013 (cost=0.00..20.12 rows=4 width=40) Filter: (t = '2016-01-04'::date)
(11 rows)
PostgreSQL will apply the filter to all the partitions in the structure. It does not know that the table name is somehow related to the content of the tables. To the database, names are just names and have nothing to do with what you are looking for. This makes sense, of course, as there is no mathematical justification for doing anything else.
The point now is: how can we teach the database that the 2016 table only contains 2016 data, the 2015 table only contains 2015 data, and so on? Table constraints are here to do exactly that. They teach PostgreSQL about the content of those tables and therefore allow the planner to make smarter decisions than before. The feature is called constraint exclusion and helps dramatically to speed up queries in many cases.
The following listing shows how table constraints can be created:
test=# ALTER TABLE t_data_2013
ADD CHECK (t < '2014-01-01'); ALTER TABLE
test=# ALTER TABLE t_data_2014
ADD CHECK (t >= '2014-01-01' AND t < '2015-01-01'); ALTER TABLE
test=# ALTER TABLE t_data_2015
ADD CHECK (t >= '2015-01-01' AND t < '2016-01-01'); ALTER TABLE
test=# ALTER TABLE t_data_2016
ADD CHECK (t >= '2016-01-01' AND t < '2017-01-01'); ALTER TABLE
For each table, a CHECK constraint can be added.
PostgreSQL will only create the constraint if all the data in those tables is perfectly correct and if every single row satisfies the constraint. In contrast to MySQL, constraints in PostgreSQL are taken seriously and honored under any circumstances. |
In PostgreSQL, those constraints can overlap--this is not forbidden and can make sense in some cases. However, it is usually better to have non-overlapping constraints because PostgreSQL has the option to prune more tables.
Here is what happens after adding those table constraints:
test=# EXPLAIN SELECT * FROM t_data WHERE t = '2016-01-04'; QUERY PLAN
----------------------------------------------------------------- Append (cost=0.00..25.00 rows=7 width=40)
-> Seq Scan on t_data (cost=0.00..0.00 rows=1 width=40) Filter: (t = '2016-01-04'::date)
-> Seq Scan on t_data_2016 (cost=0.00..25.00 rows=6 width=40)
Filter: (t = '2016-01-04'::date)
(5 rows)
The planner will be able to remove many of the tables from the query and only keep those which potentially contain the data. The query can greatly benefit from a shorter and more efficient plan. In particular, if those tables are really large, removing them can boost speed considerably.
Once in a while, data structures have to be modified. The ALTER TABLE clause is here to do exactly that. The question is: how can partitioned tables be modified?
Basically, all you have to do is tackle the parent table and add or remove columns. PostgreSQL will automatically propagate those changes through to the child tables and ensure that changes are made to all the relations, as follows:
test=# ALTER TABLE t_data ADD COLUMN x int; ALTER TABLE
test=# d t_data_2016
Table "public.t_data_2016" Column | Type | Modifiers
---------+---------+----------------------------------------------------- id | integer | not null default
t | date | payload | text | x | integer |
Check constraints:
nextval('t_data_id_seq'::regclass)
"t_data_2016_t_check"
CHECK (t >= '2016-01-01'::date AND t < '2017-01-01'::date) Inherits: t_data
As you can see, the column is added to the parent and automatically added to the child table here.
Note that this works for columns, and so on. Indexes are a totally different story. In an inherited structure, every table has to be indexed separately. If you add an index to the parent table, it will only be present on the parent-it won't be deployed on those child tables. Indexing all those columns in all those tables is your task and PostgreSQL is not going to make those decisions for you. Of course, this can be seen as a feature or as a limitation. On the upside, you could say that PostgreSQL gives you all the flexibility to index things separately and therefore potentially more efficiently. However, people might also argue that deploying all those indexes one by one is a lot more work.
Suppose you have an inherited structure. Data is partitioned by date and you want to provide the most recent years to the end user. At some point, you might want to remove some data from the scope of the user without actually touching it. You might want to put data into some sort of archive or something.
PostgreSQL provides a simple means to achieve exactly that. First, a new parent can be created:
test=# CREATE TABLE t_history (LIKE t_data); CREATE TABLE
The LIKE keyword allows you to create a table which has exactly the same layout as the t_data table. If you have forgotten which columns the t_data table actually has, this might come in handy as it saves you a lot of work. It is also possible to include indexes, constraints, and defaults.
Then, the table can be moved away from the old parent table and put below the new one. Here is how it works:
test=# ALTER TABLE t_data_2013 NO INHERIT t_data; ALTER TABLE
test=# ALTER TABLE t_data_2013 INHERIT t_history; ALTER TABLE
The entire process can of course be done in a single transaction to assure that the operation stays atomic.
One advantage of partitioned tables is the ability to clean data up quickly. Suppose that we want to delete an entire year. If data is partitioned accordingly, a simple DROP TABLE clause can do the job:
test=# DROP TABLE t_data_2014; DROP TABLE
As you can see, dropping a child table is easy. But what about the parent table? There are depending objects and therefore PostgreSQL naturally errors out to make sure that nothing unexpected happens:
test=# DROP TABLE t_data;
ERROR: cannot drop table t_data because other objects depend on it
DETAIL: default for table t_data_2013 column id depends on sequence t_data_id_seq
table t_data_2016 depends on table t_data table t_data_2015 depends on table t_data
HINT: Use DROP ... CASCADE to drop the dependent objects too.
The DROP TABLE clause will warn us that there are depending objects and refuses to drop those tables. The CASCADE clause is needed to force PostgreSQL to actually remove those objects, along with the parent table:
test=# DROP TABLE t_data CASCADE;
NOTICE: drop cascades to 3 other objects
DETAIL: drop cascades to default for table t_data_2013 column id drop cascades to table t_data_2016
drop cascades to table t_data_2015
DROP TABLE
For many years, the PostgreSQL community has been working on built-in partitioning. Finally, PostgreSQL 10.0 offers the first implementation of in-core partitioning, which will be covered in this chapter. For now, the partitioning functionality is still pretty basic. However, a lot of infrastructure for future improvements is already in place.
To show you how partitioning works, I have compiled a simple example featuring range partitioning:
CREATE TABLE data (
payload integer
) PARTITION BY RANGE (payload);
CREATE TABLE negatives PARTITION
OF data FOR VALUES FROM (MINVALUE) TO (0); CREATE TABLE positives PARTITION
OF data FOR VALUES FROM (0) TO (MAXVALUE);
In this example, one partition will hold all negative values while the other one will take care of positive values. While creating the parent table, you can simply specify which way you want to partition data.
In PostgreSQL 10.0, there is range partitioning and list partitioning. Support for hash partitioning and the like might be available as soon as PostgreSQL 11.0. |
Once the parent table has been created, it is already time to create the partitions. To do that, the PARTITION OF clause has been added. At this point, there are still some limitations. The most important one is that a tuple (= a row) cannot move from one partition to the other, for example:
UPDATE data SET payload = -10 WHERE id = 5
If there were rows satisfying this condition, PostgreSQL would simply error out and refuse to change the value. However, in case of a good design, it is a bad idea to change the partitioning key anyway. Also, keep in mind that you have to think about indexing each partition.
We learnt both, the old way of data partitioning and new data partitioning features introduced in PostgreSQL 10.0.
[box type="note" align="" class="" width=""]You read an excerpt from the book Mastering PostgreSQL 10, written by Hans-Jürgen Schönig. To know about, query optimization, stored procedures and other techniques in PostgreSQL 10.0, you may check out this book Mastering PostgreSQL 10..[/box]