MVCC implementation in PostgreSQL is unique when compared to Oracle and MySQL-like relational databases. MVCC stands for Multi-Version Concurrency Control. As the full form speaks for itself, MVCC is needed to support consistency while running transactions so that readers and writers do not block each other.
To understand it better, consider a hypothetical situation where transaction A started at 9:00 a.m. to get a count of all the records in a table: foo.bar (with 10,000,020 records). As it is a very huge table, let's say it is said to be completed in 20 minutes. Another transaction, B, started at 9:10 a.m. to delete 20 records from the same table. When transaction A, which started at 9:00 a.m., is completed at 9:20 a.m., it still should be able to see the same records as it did at 9:00 a.m., that is 10,000,020 records, without considering transaction B, which deleted some records at 9:10 a.m. Though the behavior always depends on the isolation levels, it is still able to provide a consistent view of the data as to how it was when the query actually ran. How does it work? What is happening internally? We will discuss these things in this recipe.
Throughout this book, you shall see the words records or tuples (of a table). A record in PostgreSQL is mostly referred to as a tuple. Also, PostgreSQL may be referred to as Postgres or PG in many places. They are one and the same.
Getting ready
Oracle and MySQL-like databases have separate UNDO storage that stores the past images required for consistency. If an existing record of a table is modified (updated or deleted), the past image is copied to a separate location. This way, if there is an existing transaction that started before the record got modified, it can still access the record as it was before it got modified. However, this UNDO is maintained in a separate location, not within the same table.
In PostgreSQL, UNDO is maintained in its own table. What this means is that the tuple before modification and the modified tuple are both stored in the same table.
How to do it...
In the following steps, we shall understand how PostgreSQL implements MVCC by explaining some of the system columns in detail. We shall also consider a simple example where we create a table with two columns, insert some records, and see the transaction IDs assigned to these records. We shall then query system columns such as xmin and xmax and understand how multiple versions of rows are maintained within the same table. This exercise will not only help you understand MVCC but will also show you some of the common queries that are useful in your daily admin life:
- Create a schema and a table with two columns and insert some records into it:
postgres=# CREATE SCHEMA foo;
CREATE SCHEMA
postgres=# CREATE TABLE foo.bar (id int, name varchar(5));
CREATE TABLE
postgres=# INSERT INTO foo.bar VALUES (generate_series(1,5),'avi');
INSERT 0 5
- Query the pg_attribute table to see the system columns that got added to the table along with the two columns id and name:
postgres=# SELECT attname, format_type (atttypid,atttypmod)
FROM pg_attribute
WHERE attrelid = 'foo.bar'::regclass::oid
ORDER BY attnum;
attname | format_type
----------+----------------------
tableoid | oid
cmax | cid
xmax | xid
cmin | cid
xmin | xid
ctid | tid
id | integer
name | character varying(5)
(8 rows)
- We shall then select all the columns from the table using the select * from table command and understand that we don't see any data related to the system column:
postgres=# SELECT * FROM foo.bar LIMIT 1;
id | name
----+------
1 | avi
(1 row)
- Now, to select the values of a system column exclusively, we shall include the system column name in the select command and see what it stores:
postgres=# select xmin,* from foo.bar limit 1;
xmin | id | name
-------+----+------
11705 | 1 | avi
(1 row)
- Let's query the pg_class table to see the oid of the table created in step 1:
postgres=# SELECT oid, relname FROM pg_class WHERE relname = 'bar';
oid | relname
-------+---------
31239 | bar
(1 row)
- If we have two tables with the same name, bar, but in different schemas, they do not share the same oid, as seen in the following example. In this example, we shall create another table in a different schema than the one created in step 1 and see that the oid is different for both:
postgres=# CREATE TABLE public.bar (id int, name varchar(5));
CREATE TABLE
postgres=# SELECT oid, relname FROM pg_class WHERE relname = 'bar' and relkind = 't';
oid | relname
-------+---------
31242 | bar
31239 | bar
(2 rows)
- To properly identify the table that belongs to a specific schema, we could join pg_namespace with pg_class as seen in the following log:
postgres=# SELECT pc.oid, pn.nspname, pc.relname
FROM pg_class pc
JOIN pg_namespace pn ON pc.relnamespace = pn.oid
WHERE pn.nspname = 'foo'
AND pc.relname = 'bar';
oid | nspname | relname
-------+---------+---------
31239 | foo | bar
(1 row)
- We could also use regclass to identify the oid of a fully qualified table. A fully qualified table is a table specified along with its schemaname (schemaname.tablename):
postgres=# select 'foo.bar'::regclass::oid;
oid
-------
31239
(1 row)
- In this step, we will see how the system column tableoid can be seen from the table for each record and understand that it is the same as the oid of the table:
postgres=# select tableoid, id, name from foo.bar limit 1;
tableoid | id | name
----------+----+------
31239 | 1 | avi
(1 row)
- Every transaction in PostgreSQL has a unique transaction ID. In this step, we shall see how a transaction ID remains the same within a transaction block and changes for a new transaction:
postgres=# BEGIN;
BEGIN
postgres=# select txid_current();
txid_current
--------------
11902
(1 row)
postgres=# select txid_current();
txid_current
--------------
11902
(1 row)
postgres=# END;
COMMIT
postgres=# select txid_current();
txid_current
--------------
11903
(1 row)
- By querying xmin explicitly, we can see the transaction ID that inserted the records by finding the xmin value of each record. Notice the xmin values of all the records in the following log:
postgres=# select xmin,* from foo.bar;
xmin | id | name
-------+----+------
11705 | 1 | avi
11705 | 2 | avi
11705 | 3 | avi
11705 | 4 | avi
11705 | 5 | avi
11905 | 6 | avi
(6 rows)
- We could also find the xmax of each record by explicitly selecting it. If xmax is set to 0, it was never deleted and is visible:
postgres=# select xmin, xmax, * from foo.bar ;
xmin | xmax | id | name
-------+------+----+------
11705 | 0 | 1 | avi
11705 | 0 | 2 | avi
11705 | 0 | 3 | avi
11705 | 0 | 4 | avi
11705 | 0 | 5 | avi
11905 | 0 | 6 | avi
11907 | 0 | 7 | avi
(7 rows)
- If we perform a delete operation to delete a record, subsequent select queries cannot see the deleted record anymore:
postgres=# BEGIN;
BEGIN
postgres=# DELETE FROM foo.bar WHERE id = 7;
DELETE 1
postgres=# COMMIT;
COMMIT
postgres=# select xmin, xmax, * from foo.bar ;
xmin | xmax | id | name
-------+------+----+------
11705 | 0 | 1 | avi
11705 | 0 | 2 | avi
11705 | 0 | 3 | avi
11705 | 0 | 4 | avi
11705 | 0 | 5 | avi
11905 | 0 | 6 | avi
(6 rows)
- Now, let's use two terminals in parallel. In one terminal, we shall delete a record and then observe the xmin and xmax values of the record being deleted from another terminal, before committing delete:
- Terminal 1: Running delete but not committing it. Note the transaction ID that performed delete:
postgres=# BEGIN;
BEGIN
postgres=# select txid_current();
txid_current
--------------
11911
(1 row)
postgres=# DELETE FROM foo.bar WHERE id = 6;
DELETE 1
- Terminal 2: We can see the xmax value changed to the transaction ID that executed delete in terminal 1:
postgres=# select xmin, xmax, * from foo.bar ;
xmin | xmax | id | name
-------+-------+----+------
11705 | 0 | 1 | avi
11705 | 0 | 2 | avi
11705 | 0 | 3 | avi
11705 | 0 | 4 | avi
11705 | 0 | 5 | avi
11905 | 11911 | 6 | avi
(6 rows)
- Roll back the delete and now see the xmax value:
- Terminal 1: Let's issue rollback instead of commit so that the record is not deleted:
postgres=# BEGIN;
BEGIN
postgres=# select txid_current();
txid_current
--------------
11911
(1 row)
postgres=# DELETE FROM foo.bar WHERE id = 6;
DELETE 1
postgres=# ROLLBACK;
ROLLBACK
- Terminal 2: We can see that the xmax still remains the same but internally the hint bits xact_rolled_backed will be set to true:
$ psql -d postgres -c "select xmin, xmax, id, name from foo.bar"
xmin | xmax | id | name
-------+-------+----+------
11705 | 0 | 1 | avi
11705 | 0 | 2 | avi
11705 | 0 | 3 | avi
11705 | 0 | 4 | avi
11705 | 0 | 5 | avi
11905 | 11911 | 6 | avi
(6 rows)
- We could query the location of each tuple by querying the system column, ctid:
postgres=# select xmin, xmax, ctid, * from foo.bar ;
xmin | xmax | ctid | id | name
-------+-------+-------+----+------
11705 | 0 | (0,1) | 1 | avi
11705 | 0 | (0,2) | 2 | avi
11705 | 0 | (0,3) | 3 | avi
11705 | 0 | (0,4) | 4 | avi
11705 | 0 | (0,5) | 5 | avi
11905 | 11911 | (0,6) | 6 | avi
(6 rows)
How it works...
In order to understand how MVCC works in PostgreSQL, it is important to understand some of the system columns of a table in PostgreSQL. The preceding example contains a demonstration of the hidden columns of a table in PostgreSQL along with the changes to their values when their corresponding records are modified.
If you observe Step 1, it is visible that a table with the name foo.bar has been created with just two columns. However, when you see the output in Step 2, it is interesting to see that it is not just two columns but there are some additional columns that are automatically created by PostgreSQL.
Well, through the output, it is clear that there are six additional columns to what is assumed to be created when we create a table using the CREATE TABLE syntax. To understand how these columns make a significant difference to the way MVCC
is implemented in PostgreSQL, let's learn about these system columns in detail.
Though these columns are considered to be hidden, it doesn't mean that the values in the columns are a mystery to an admin. The reason why these columns are considered hidden columns is they are excluded from the output of select * from table,
as seen in the output of Step 3.
In order to see what values are stored in these hidden columns, these columns need to be exclusively used in the SELECT statement as seen in Step 4. In this example, we see the difference between selecting all the columns of a table versus selecting a system column exclusively along with the actual columns.
tableoid
Now, before learning about tableoid, it is important to understand what an OID is. An OID in PostgreSQL stands for an Object Identifier. When a table is created in PostgreSQL, a new record with the table name and the schema name is inserted into the system tables – pg_class and pg_namespace. OIDs are used by PostgreSQL internally as a primary key for such system tables. In order to find the oid of the table foo.bar that was created earlier, the easiest way is to query the pg_class system table as seen in Step 5.
But, what if there is more than one table with the same name but in two different schemas? In PostgreSQL, it is possible to have more than one schema in a single database. For example, if we observe the output in Step 6, it is visible that a table with the same name as the table created in Step 1 was created in a different schema as well.
Thus, in order to find the oid of the table that corresponds to the appropriate schema, pg_class can be joined with the system table pg_namespace (which contains the schema name and the oid of the schema). For every relation in pg_class, the oid of its schema is also inserted. To see that in action, the log in Step 7 contains simple SQL to identify the oid of a specific table that belongs to a specific schema.
There is another easy way to find the OID of a table, using regclass. Substitute foo.bar with the schema name and table name as seen in Step 8.
Now to understand tableoid in a simple way, it is nothing but a column that contains the oid of the table, which is the same as the oid visible in the pg_class table. See Step 9, which illustrates how we can select the tableoid along with the other columns of a table.
xmin
xmin is one of the important columns that a PostgreSQL admin should be fully aware of. An admin's day-to-day activity totally depends on understanding xmin very well. To understand xmin better, let's learn about transaction IDs in PostgreSQL. We are not going to discuss problems with transaction IDs in this chapter; this is just an introduction. For now, let's remember that a transaction ID is a unique identifier assigned to a transaction.
A transaction ID in PostgreSQL is a 32-bit unsigned integer. It is cyclic, which means that it starts from 0 and goes up to 4.2 billion (4,294,967,295) and then starts from 0 again. The function txid_current() shows the ID of the current transaction. If we observe the output in Step 10 carefully, we see that the transaction ID stayed the same within the entire transaction (between BEGIN and END) but it changed incrementally for another new transaction.
As we've understood the transaction ID now, xmin
is nothing but the transaction ID that inserted that tuple. For example, in the output of Step 11, we can see that the first five records were inserted by a transaction with the ID 11705 and the last record was inserted by a transaction with the ID 11905.
This difference in xmin
is essential in determining what tuples are visible to a transaction. For example, an SQL statement in a transaction that started before 11905 may not be able to see the records inserted by its future transactions.
xmax
The xmax value makes a significant difference when there are tuples that are being deleted or updated. Before we start to learn about xmax, see the log in Step 12, which shows the xmax value of the records in the foo.bar table.
In the log, we see that the value of xmax is 0. The value of xmax is 0 when it is a row that was never deleted or attempted for delete. There are two scenarios that could happen when you consider deleting a record:
- A delete command was issued by a transaction and it was committed.
- A delete command was issued by a transaction but it hasn't been committed yet, after it.
In the first scenario, it is quite understandable that when a delete was issued and committed, the record was no more visible, as seen in Step 13. So, there is no point in discussing the xmax value for that record.
But, what about the second scenario, where the delete has not been committed yet? To demonstrate that, I have issued a delete in one terminal and looked at the xmax value in another terminal, as seen in Step 14. If you look at the terminal 2 log carefully, the xmax value has been updated with the transaction ID that issued the delete. Please note that the xmax value remains the same as the transaction ID that issued the delete when a ROLLBACK is issued. And when the delete is committed, as discussed earlier, the record is no longer visible to the future selects.
As seen in Step 15, if I issue a ROLLBACK instead of COMMIT, the xmax value remains the same as the transaction ID that issued a delete before the rollback.
As we understood xmin and xmax now, when a transaction runs SELECT on a table, the records that are visible to the transaction are the tuples with (xmin <= txid_current()) and (xmax = 0 OR txid_current() < xmax):
select * from foo.bar where id = 2 ;
The preceding SQL issued by a transaction internally uses the following logic:
select * from foo.bar where id = 2 (and xmin <= txid_current() AND (xmax = 0 OR txid_current() < xmax));
ctid
ctid is the field that denotes the location of a tuple in a Postgres table. It is unique for each tuple. It contains the page/block number along with the tuple index within that page for the tuple. For example, the log in Step 16 shows that all the tuples are stored in page 0 and it also shows their locations within the page.
pageinspect
We are going to discuss extensions in PostgreSQL in future chapters. For now, consider them as a piece of external code that can be attached to existing Postgres code to achieve a specific functionality. pageinspect is an extension that is included with the contrib
module, which is useful in showing the contents of a page. All the tuples of a table are stored in one or more pages. This extension gives granular visibility to the contents stored inside each page.
To create this extension, we shall just issue the command seen in Step 17:
postgres=# CREATE EXTENSION pageinspect ;
CREATE EXTENSION
-- Verify
postgres=# \dx
List of installed extensions
Name | Version | Schema | Description
-------------+---------+------------+-------------------------------------------------------
pageinspect | 1.6 | public | inspect the contents of database pages at a low level
plpgsql | 1.0 | pg_catalog | PL/pgSQL procedural language
(2 rows)
This extension provides two functions:
get_raw_page : reads the specified 8KB page
heap_page_item_attrs : shows metadata and data of each tuple
From the previous log, we saw that there are six records after deleting one record from the table. But, has the record really been deleted from the table? Let's look at what is stored inside the page.
As there are very few tuples inside the table, we can see from the following output that there is only 1 page of size 8 KB for this table:
$ psql -d postgres -c "select relname, relpages from pg_class where oid = 'foo.bar'::regclass::oid"
relname | relpages
---------+----------
bar | 1
(1 row)
$ psql -c "show block_size"
block_size
------------
8192
(1 row)
The page sequence starts from 0. So, we shall use pageinspect
to see what is inside page 0:
$ psql -d postgres -c "SELECT t_xmin, t_xmax, t_field3 as t_cid, t_ctid FROM
heap_page_items(get_raw_page('foo.bar',0))"
t_xmin | t_xmax | t_cid | t_ctid
--------+--------+-------+--------
11705 | 0 | 0 | (0,1)
11705 | 0 | 0 | (0,2)
11705 | 0 | 0 | (0,3)
11705 | 0 | 0 | (0,4)
11705 | 0 | 0 | (0,5)
11905 | 11911 | 0 | (0,6)
11907 | 11910 | 0 | (0,7)
(7 rows)
In the previous log, we saw that there is no such tuple with ctid = (0,7)
. But we have deleted the record (committed) with id = 7. Is it still quite surprising that is not gone from the page? I don't think it is anymore, because we discussed earlier that UNDO is stored in its own table. So, a tuple that was deleted earlier is still stored in the table until a cleanup process removes it. The cleanup process (VACUUM
) removes it only when there are no transactions dependent on the deleted record. In the following log, we saw that the record with this ctid
has its xmax_committed
set to t (true). What this means is that a delete was issued by transaction ID 11910
and it got committed:
postgres=# \x
Expanded display is on.
postgres=# SELECT lp,
t_ctid AS ctid,
t_xmin AS xmin,
t_xmax AS xmax,
(t_infomask & 128)::boolean AS xmax_is_lock,
(t_infomask & 1024)::boolean AS xmax_committed,
(t_infomask & 2048)::boolean AS xmax_rolled_back,
(t_infomask & 4096)::boolean AS xmax_multixact,
t_attrs[1] AS p_id,
t_attrs[2] AS p_val
FROM heap_page_item_attrs(
get_raw_page('foo.bar', 0),
'foo.bar'
) WHERE lp = 7;
-[ RECORD 1 ]----+-----------
lp | 7
ctid | (0,7)
xmin | 11907
xmax | 11910
xmax_is_lock | f
xmax_committed | t
xmax_rolled_back | f
xmax_multixact | f
p_id | \x07000000
p_val | \x09617669
There's more...
So far, in the previous sections, we have understood how MVCC works in PostgreSQL. The final conclusion is that there may be several row versions maintained within each table due to deletions or updates. Over a period of time, there may be many such deleted records still stored in each page. Such records/tuples are called dead tuples. And the tuples that are inserted and remain unmodified are called live tuples. Dead tuples occupy more space and may decrease the performance of queries in the database. How should we manage these dead tuples? Should we perform any periodic manual maintenance or is it taken care of automatically? If it's automatic, what does that job? The answer to all of these questions is VACUUM. Let's learn about it in detail now.
When you start Postgres, you should see that there is a list of background processes running, as seen in the following screenshot. These processes (aka utility processes
) take some responsibility each to help users in the best possible way. One of these processes is the autovacuum launcher
process. This process takes the responsibility of starting VACUUM
and ANALYZE
tasks on tables:
VACUUM
cleans up dead tuples so that the space occupied by them can be reused by future inserts (an update does a deletion and an insertion). Whereas an ANALYZE
collects the statistics of a table so that the execution plan prepared by the parser for a query using this table is optimal. There are certain parameters in PostgreSQL (postgresql.conf
) that are used by this process to determine when to run an autovacuum
vacuum
or an autovacuum analyze
on a table. We shall learn about tuning autovacuum and the internals of autovacuum in future chapters.