Quickly estimating the number of rows in a table
We don't always need an accurate count of rows, especially on a large table that may take a long time to execute. Administrators often need to estimate how big a table is so that they can estimate how long other operations may take.
How to do it…
The Postgres optimizer can provide a quick estimate of the number of rows in a table simply by using its statistics:
EXPLAIN SELECT * FROM mytable; QUERY PLAN ---------------------------------------------------------------- Seq Scan on mytable (cost=0.00..2640.00 rows=100000 width=97) (1 row)
We can directly compute a similar number using roughly the same calculation:
SELECT (CASE WHEN reltuples > 0 THEN pg_relation_size(oid)*reltuples/(8192*relpages) ELSE 0 END)::bigint AS estimated_row_count FROM pg_class WHERE oid = 'mytable'::regclass;
This gives us the following output:
estimated_row_count --------------------- 99960 (1 row)
Both queries return a row count very quickly, no matter how large the table that we are examining is, because they use statistics that were collected in advance. You may want to create a SQL function for the preceding calculation, so you won't need to retype the SQL code every now and then.
The following function estimates the total number of rows using a mathematical procedure called extrapolation. In other words, we take the average number of bytes per row resulting from the last statistics collection, and we apply it to the current table size:
CREATE OR REPLACE FUNCTION estimated_row_count(text) RETURNS bigint LANGUAGE sql AS $$ SELECT (CASE WHEN reltuples > 0 THEN pg_relation_size($1)*reltuples/(8192*relpages) ELSE 0 END)::bigint FROM pg_class WHERE oid = $1::regclass; $$;
How it works…
We saw the pg_relation_size()
function earlier, so we know that it brings back an accurate value for the current size of the table.
When we vacuum a table in Postgres, we record two pieces of information in the pg_class
catalog entry for the table. These two items are the number of data blocks in the table (relpages
) and the number of rows in the table (reltuples
). Some people think they can use the value of reltuples
in pg_class
as an estimate, but it could be severely out of date. You will also be fooled if you use information in another table named pg_stat_user_tables
, which is discussed in more detail in Chapter 10, Performance and Concurrency.
The Postgres optimizer uses the relpages
and reltuples
values to calculate the average rows per block, which is also known as the average tuple density.
If we assume that the average tuple density remains constant over time, then we can calculate the number of rows using this formula: Row estimate = number of data blocks * rows per block.
We include some code to handle cases where the reltuples
or relpages
fields are zero. The Postgres optimizer actually works a little harder than we do in that case, so our estimate isn't very good.
The WHERE oid = 'mytable'::regclass;
syntax introduces the concept of object identifier types. They just use a shorthand trick to convert the name of an object to the object identifier number for that object. The best way to understand this is to think of that syntax as meaning the same as a function named relname2relid()
.
There's more…
The good thing about the preceding recipe is that it returns a value in about the same time, no matter how big the table is. The bad thing about it is that pg_relation_size()
requests a lock on the table, so if any other user has an AccessExclusiveLock
lock on the table, then the table size estimate will wait for the lock to be released before returning a value.
Err... so what is an AccessExclusiveLock
lock? While performing a SQL maintenance action, such as changing the data type of a column, PostgreSQL will lock out all other actions on that table, including pg_relation_size
, which takes a lock in the AccessShareLock
mode. For me, a typical case is when I issue some form of SQL maintenance action, such as ALTER TABLE
, and the statement takes much longer than I thought it would. At that point, I think, Oh, was that table bigger than I thought? How long will I be waiting? Yes, it's better to calculate that beforehand, but hindsight doesn't get you out of the hole you are in right now. So, we need a way to calculate the size of a table without needing the lock.
A solution is to look at the operating system files that Postgres uses to store data, and figure out how large they are, but that requires a high level of security than most people usually allow. In any case, looking at files without a lock could cause problems if the table were dropped or changed.