Forcing a query to use an index
Often, we think we know better than the database optimizer. Most of the time, your expectations are wrong, and if you look carefully, you’ll see that. So, recheck everything and come back later.
It is a classic error to try to get the database optimizer to use indexes when the database has very little data in it. Put some genuine data in the database first, then worry about it. Better yet, load some data on a test server first, rather than doing this in production.
Sometimes, the optimizer gets it wrong. You feel elated—and possibly angry—that the database optimizer doesn’t see what you see. Please bear in mind that the data distributions within your database change over time, and this causes the optimizer to change its plans over time as well.
If you have found a case where the optimizer is wrong, this can sometimes change over time as the data changes. It might have been correct last week and will be correct again next week, or it correctly calculated that a change of plan was required, but it made that change slightly ahead of time or slightly too late. Again, trying to force the optimizer to do the right thing now might prevent it from doing the right thing later, when the plan changes again. So hinting fixes things in the short term, but in the longer term can cause problems to resurface.
In the long run, it is not recommended to try to force the use of a particular index.
Getting ready
Still here? Oh well.
If you really feel this is necessary, then your starting point is to run an EXPLAIN
command for your query, so please read the previous recipe first.
How to do it…
The most common problem is selecting too much data.
A typical point of confusion comes from data that has a few very common values among a larger group. Requesting data for very common values costs more because we need to bring back more rows. As we bring back more rows, the cost of using the index increases. Therefore, it is possible that we won’t use the index for very common values, whereas we would use the index for less common values. To use an index effectively, make sure you’re using the LIMIT
clause to reduce the number of rows that are returned.
Since different index values might return more or less data, it is common for execution times to vary depending on the exact input parameters. This could cause a problem if we are using prepared statements—the first five executions of a prepared statement are made using “custom plans” that vary according to the exact input parameters. From the sixth execution onward, the optimizer decides whether to use a “generic plan” or not, if it thinks the cost will be lower on average. Custom plans are more accurate, but the planning overhead makes them less efficient than generic plans. This heuristic can go wrong at times and you might need to override it using plan_cache_mode = force_generic_plan
or force_custom_plan
.
Another technique for making indexes more usable is partial indexes. Instead of indexing all of the values in a column, you might choose to index only a set of rows that are frequently accessed—for example, by excluding NULL
or other unwanted data. By making the index smaller, it will be cheaper to access and will fit within the cache better, preventing pointless work by targeting the index at only the important data. Data statistics are kept for such indexes, so it can also improve the accuracy of query planning. Let’s look at an example:
CREATE INDEX ON customer(id)
WHERE blocked = false AND subscription_status = 'paid';
Another common problem is that the optimizer may make errors in its estimation of the number of rows returned, causing the plan to be incorrect. Some optimizer estimation errors can be corrected using CREATE STATISTICS
. If the optimizer is making errors, it can be because the WHERE
clause contains multiple columns. For example, queries that mention related columns such as state
and phone_area_code
or city
and zip_code
will have poor estimates because those pairs of columns have data values that are correlated.
You can define additional statistics that will be collected when you next analyze the table:
CREATE STATISTICS cust_stat1 ON state, area_code FROM cust;
The execution time of ANALYZE
will increase to collect the additional stats information, plus there is a small increase in query planning time, so use this sparingly when you can confirm this will make a difference. If there is no benefit, use DROP STATISTICS
to remove them again. By default, multiple types of statistics will be collected—you can fine-tune this by specifying just a few types of statistics if you know what you are doing.
Unfortunately, the statistics
command doesn’t automatically generate names, so include the table name in the statistics you create since the name is unique within the database and cannot be repeated on different tables. In future releases, we may also add cross-table statistics.
Additionally, you cannot collect statistics on individual fields within JSON documents at the moment, nor collect dependency information between them; this command only applies to whole column values at this time.
Another nudge toward using indexes is to set random_page_cost
to a lower value—maybe even equal to seq_page_cost
. This makes PostgreSQL prefer index scans on more occasions, but it still does not produce entirely unreasonable plans, at least for cases where data is mostly cached in shared buffers or system disk caches, or underlying disks are solid-state drives (SSDs).
The default values for these parameters are provided here:
random_page_cost = 4;
seq_page_cost = 1;
Try setting this:
set random_page_cost = 2;
See if it helps; if not, you can try setting it to 1
.
Changing random_page_cost
allows you to react to whether data is on disk or in memory. Letting the optimizer know that more of an index is in the cache will help it to understand that using the index is actually cheaper.
Index scan performance for larger scans can also be improved by allowing multiple asynchronous I/O operations by increasing effective_io_concurrency
. Both random_page_cost
and effective_io_concurrency
can be set for specific tablespaces or for individual queries.
There’s more…
PostgreSQL does not directly support hints, but they are available via an extension.
If you absolutely, positively have to use the index, then you’ll want to know about an extension called pg_hint_plan
. It is available for PostgreSQL 9.1 and later versions. For more information and to download it, go to http://pghintplan.sourceforge.jp/. Hints can be added to your application SQL using a special comment added to the start of a query, like this:
/*+ IndexScan(tablename indexname) */ SELECT …
It works, but, as I said previously, try to avoid fixing things now and causing yourself pain later when the data distribution changes.
EnterpriseDB (EDB) and Postgres Advanced Server (EPAS) also support hints in an Oracle-style syntax to allow you to select a specific index, like this:
SELECT /*+ INDEX(tablename indexname) */ … rest of query …
EPAS has many compatibility features such as this for migrating application logic from Oracle. See https://www.enterprisedb.com/docs/epas/latest/epas_compat_ora_dev_guide/05_optimizer_hints/ for more information on this.