3 ways to use Indexes in Teradata to improve database performance

In this tutorial, we will create solutions to design indexes to help us improve query performance of Teradata database management system.

[box type="note" align="" class="" width=""]This article is an excerpt from a book co-authored by Abhinav Khandelwal and Rajsekhar Bhamidipati titled Teradata Cookbook. This book will teach you to tackle problems related to efficient querying, stored procedure searching, and navigation techniques in a Teradata database.[/box]

Creating a partitioned primary index to improve performance

A PPI (partitioned primary index) is a type of index that enables users to set up databases that provide performance benefits from a data locality, while retaining the benefits of scalability inherent in the hash architecture of the Teradata database. This is achieved by hashing rows to different virtual AMPs, as is done with a normal PI, but also by creating local partitions within each virtual AMP.

We will see how PPIs will improve the performance of a query.

Getting ready

You need to connect to the Teradata database. Let's create a table and insert data into it using the following DDL. This will be a non-partitioned table, as follows:

/*NON PPI TABLE DDL*/

CREATE volatile TABLE EMP_SAL_NONPPI

(

id INT,

Sal INT,

dob DATE,

o_total INT

) primary index( id)  

on commit preserve rows;

INSERT into EMP_SAL_NONPPI VALUES (1001,2500,'2017-09-01',890);

INSERT into EMP_SAL_NONPPI VALUES (1002,5500,'2017-09-10',890);

INSERT into EMP_SAL_NONPPI VALUES (1003,500,'2017-09-02',890);

INSERT into EMP_SAL_NONPPI VALUES (1004,54500,'2017-09-05',890);

INSERT into EMP_SAL_NONPPI VALUES (1005,900,'2017-09-23',890);

INSERT into EMP_SAL_NONPPI VALUES (1006,8900,'2017-08-03',890);

INSERT into EMP_SAL_NONPPI VALUES (1007,8200,'2017-08-21',890);

INSERT into EMP_SAL_NONPPI VALUES (1008,6200,'2017-08-06',890);

INSERT into EMP_SAL_NONPPI VALUES (1009,2300,'2017-08-12',890);

INSERT into EMP_SAL_NONPPI VALUES (1010,9200,'2017-08-15',890);

Let's check the explain plan of the following query; we are selecting data based on the DOB column using the following code:

/*Select on NONPPI table*/

SELECT * from EMP_SAL_NONPPI

where dob <= 2017-08-01

Following is the snippet from SQLA showing explain plan of the query:

3-ways-to-use-indexes-in-teradata-to-improve-database-performance-img-0

As seen in the following explain plan, an all-rows scan can be costly in terms of CPU and I/O if the table has millions of rows:

Explain SELECT * from EMP_SAL_NONPPI

where dob <= 2017-08-01;

/*EXPLAIN PLAN of SELECT*/

1) First, we do an all-AMPs RETRIEVE step from DBC.EMP_SAL_NONPPI by way of an all-rows scan with a condition of ("DBC.EMP_SAL_NONPPI.dob <= DATE '1900-12-31'") into Spool 1 (group_amps), which is built locally on the AMPs. The size of Spool 1 is estimated with no confidence to be 4 rows (148 bytes). The estimated time for this step is 0.04 seconds.

2) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. -> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0.04 seconds.

Let's see how we can enable partition retrieval in the same query.

How to do it...

Connect to the Teradata database using SQLA or Studio.

Create the following table with the data. We will define a PPI on the column

DOB:

/*Partition table*/

CREATE volatile TABLE EMP_SAL_PPI

(

id INT,

Sal int,

dob date,

o_total int

) primary index( id)

PARTITION BY RANGE_N (dob BETWEEN DATE '2017-01-01'

AND DATE '2017-12-01' EACH INTERVAL '1' DAY)

on commit preserve rows;

INSERT into EMP_SAL_PPI VALUES (1001,2500,'2017-09-01',890);

INSERT into EMP_SAL_PPI VALUES (1002,5500,'2017-09-10',890);

INSERT into EMP_SAL_PPI VALUES (1003,500,'2017-09-02',890);

INSERT into EMP_SAL_PPI VALUES (1004,54500,'2017-09-05',890);

INSERT into EMP_SAL_PPI VALUES (1005,900,'2017-09-23',890);

INSERT into EMP_SAL_PPI VALUES (1006,8900,'2017-08-03',890);

INSERT into EMP_SAL_PPI VALUES (1007,8200,'2017-08-21',890);

INSERT into EMP_SAL_PPI VALUES (1008,6200,'2017-08-06',890);

INSERT into EMP_SAL_PPI VALUES (1009,2300,'2017-08-12',890);

INSERT into EMP_SAL_PPI VALUES (1010,9200,'2017-08-15',890);

Let's execute the same query on a new partition table:

/*SELECT on PPI table*/

sel * from EMP_SAL_PPI

where dob <= 2017-08-01

Following snippet from SQLA shows query and explain plan of the query:

3-ways-to-use-indexes-in-teradata-to-improve-database-performance-img-1

The data is being accessed using only a single partition, as shown in the following block:

/*EXPLAIN PLAN*/

1) First, we do an all-AMPs RETRIEVE step from a single partition

of

SYSDBA.EMP_SAL_PPI with a condition of

("SYSDBA.EMP_SAL_PPI.dob =

DATE '2017-08-01'") with a residual condition of (

"SYSDBA.EMP_SAL_PPI.dob = DATE '2017-08-01'") into Spool 1

(group_amps), which is built locally on the AMPs. The size of

Spool 1 is estimated with no confidence to be 1 row (37

bytes).

The estimated time for this step is 0.04 seconds. -> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0.04 seconds.

How it works...

A partitioned PI helps in improving the performance of a query by avoiding a full table scan elimination. A PPI works the same as a primary index for data distribution, but creates partitions according to ranges or cases, as specified in the table. There are four types of PPI that can be created in a table:

Case partitioning:

/*CASE partition*/

CREATE TABLE SALES_CASEPPI

(

ORDER_ID INTEGER,

CUST_ID INTERGER,

ORDER_DT DATE,

)

PRIMARY INDEX(ORDER_ID)

PARTITION BY CASE_N(ORDER_ID < 101,

ORDER_ ID < 201,

ORDER_ID < 501,

NO CASE,UNKNOWN);

Range-based partitioning:

/*Range Partition table*/

CREATE volatile TABLE EMP_SAL_PPI

(

id INT,

Sal int,

dob date,

o_total int

) primary index( id)

PARTITION BY RANGE_N (dob BETWEEN DATE '2017-01-01'

AND DATE '2017-12-01' EACH INTERVAL '1' DAY)

on commit preserve rows

Multi-level partitioning:

CREATE TABLE SALES_MLPPI_TABLE

(

ORDER_ID INTEGER NOT NULL,

CUST_ID INTERGER,

ORDER_DT DATE,

)

PRIMARY INDEX(ORDER_ID)

PARTITION BY (RANGE_N(ORDER_DT BETWEEN DATE '2017-08-01' AND DATE

'2017-12-31'

EACH INTERVAL '1' DAY)

CASE_N (ORDER_ID < 1001,

ORDER_ID < 2001,

ORDER_ID < 3001,

NO CASE, UNKNOWN));

Character-based partitioning:

/*CHAR Partition*/

CREATE TABLE SALES_CHAR_PPI (

ORDR_ID INTEGER,

EMP_NAME VARCHAR (30) CHARACTER,

PRIMARY INDEX (ORDR_ID)

PARTITION BY CASE_N (

EMP_NAME LIKE 'A%', EMP_NAME LIKE 'B%',

EMP_NAME LIKE 'C%', EMP_NAME LIKE 'D%',

EMP_NAME LIKE 'E%', EMP_NAME LIKE 'F%',

NO CASE, UNKNOWN);

PPI not only helps in improving the performance of queries, but also helps in table maintenance. But there are certain performance considerations that you might need to keep in mind when creating a PPI on a table, and they are:

If partition column criteria is not present in the WHERE clause while selecting primary indexes, it can slow the query

The partitioning of the column must be carefully chosen in order to gain maximum benefits

Drop unneeded secondary indexes or value-ordered join indexes

Creating a join index to improve performance

A join index is a data structure that contains data from one or more tables, with or without aggregation:

In this, we will see how join indexes help in improving the performance of queries.

Getting ready

You need to connect to the Teradata database using SQLA or Studio. Let's create a table and insert the following code into it:

CREATE TABLE td_cookbook.EMP_SAL

(

id INT,

DEPT varchar(25),

emp_Fname varchar(25),

emp_Lname varchar(25),

emp_Mname varchar(25),

status INT

)primary index(id);

INSERT into td_cookbook.EMP_SAL VALUES (1,'HR','Anikta','lal','kumar',1);

INSERT into td_cookbook.EMP_SAL VALUES (2,'HR','Anik','kumar','kumar',2);

INSERT into td_cookbook.EMP_SAL VALUES (3,'IT','Arjun','sharma','lal',1);

INSERT into td_cookbook.EMP_SAL VALUES (4,'SALES','Billa','Suti','raj',2);

INSERT into td_cookbook.EMP_SAL VALUES (4,'IT','Koyd','Loud','harlod',1);

INSERT into td_cookbook.EMP_SAL VALUES (2,'HR','Harlod','lal','kumar',1);

Further, we will create a single table join index with a different primary index of the table.

How to do it...

The following are the steps to create a join index to improve performance:

Connect to the Teradata database using SQLA or Studio.

Check the explain plan for the following query:

/*SELECT on base table*/

EXPLAIN SELECT id,dept,emp_Fname,emp_Lname,status from

td_cookbook.EMP_SAL

where id=4;

1) First, we do a single-AMP RETRIEVE step from td_cookbook.EMP_SAL by way of the primary index "td_cookbook.EMP_SAL.id = 4" with no residual conditions into Spool 1 (one-amp), which is built locally on that AMP. The size of Spool 1 is estimated with low confidence to be 2 rows (118 bytes). The estimated time for this step is 0.02 seconds.

-> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0.02 seconds.

Query with a WHERE clause on id; then the system will query the EMP table using the primary index of the base table, which is id.

Now, if a user wants to query a table on column emp_Fname, an all row scan will occur, which will degrade the performance of the query, as shown in the following screenshot:

3-ways-to-use-indexes-in-teradata-to-improve-database-performance-img-2

Now, we will create a JOIN INDEX using emp_Fname as the primary index:

/*Join Index*/

CREATE JOIN INDEX td_cookbook.EMP_JI

AS

SELECT id,emp_Fname,emp_Lname,status,emp_Mname,dept

FROM td_cookbook.EMP_SAL

PRIMARY INDEX(emp_Fname);

Let's collect statistics on the join index:

/*Collect stats on JI*/

collect stats td_cookbook.EMP_JI column emp_Fname

Now, we will check the explain plan query on the WHERE clause using the column emp_Fname:

Explain sel id,dept,emp_Fname,emp_Lname,status from

td_cookbook.EMP_SAL

where emp_Fname='ankita';

1) First, we do a single-AMP RETRIEVE step from td_cookbooK.EMP_JI by way of the primary index "td_cookbooK.EMP_JI.emp_Fname = 'ankita'" with no residual conditions into Spool 1 (one-amp), which is built locally on that AMP. The size of Spool 1 is estimated with low confidence to be 2 rows (118 bytes). The estimated time for this step is 0.02 seconds. -> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0.02 seconds.

In EXPLAIN, you can see that the optimizer is using the join index instead of the base table when the table queries are using the Emp_Fname column.

How it works...

Query performance improves any time a join index can be used instead of the base tables. A join index is most useful when its columns can satisfy, or cover, most or all of the requirements in a query.

For example, the optimizer may consider using a covering index instead of performing a merge join.

When we are able to cover all the queried columns that can be satisfied by a join index, then it is called a cover query.

Covering indexes improve the speed of join queries. The extent of improvement can be dramatic, especially for queries involving complex, large-table, and multiple-table joins. The extent of such improvement depends on how often an index is appropriate to a query.

There are a few more join indexes that can be used in Teradata:

Aggregate-table join index: A type of join index which pre-joins and summarizes aggregated tables without requiring any physical summary tables. It refreshes automatically whenever the base table changes. Only COUNT and SUM are permitted, and DISTINCT is not permitted:

/*AG JOIN INDEX*/

CREATE JOIN INDEX Agg_Join_Index AS

SELECT Cust_ID,

Order_ID,

SUM(Sales_north) -- Aggregate column

FROM

sales_table

GROUP BY 1,2

Primary Index(Cust_ID)

Use FLOAT as a data type for COUNT and SUM to avoid overflow.

Sparse join index: When a WHERE clause is applied in a JOIN INDEX, it is know as a sparse join index. By limiting the number of rows retrieved in a join, it reduces the size of the join index. It is also useful for UPDATE statements where the index is highly selective:

/*SP JOIN INDEX*/

CREATE JOIN INDEX Sparse_Join_Index AS

SELECT Cust_ID,

Order_ID,

SUM(Sales_north) -- Aggregate column

FROM

sales_table

where Order_id = 1 -- WHERE CLAUSE

GROUP BY 1,2

Primary Index(Cust_ID)

Creating a hash index to improve performance

Hash indexes are designed to improve query performance like join indexes, especially single table join indexes, and in addition, they enable you to avoid accessing the base table. The syntax for the hash index is as follows:

/*Hash index syntax*/

CREATE HASH INDEX <hash-index-name>

[, <fallback-option>]

(<column-name-list1>) ON <base-table>

[BY (<partition-column-name-list2>)]

[ORDER BY <index-sort-spec>] ;

Getting ready

You need to connect to the Teradata database. Let's create a table and insert data into it using the following DDL:

/*Create table with data*/

CREATE TABLE td_cookbook.EMP_SAL

(

id INT,

DEPT varchar(25),

emp_Fname varchar(25),

emp_Lname varchar(25),

emp_Mname varchar(25),

status INT

)primary index(id);

INSERT into td_cookbook.EMP_SAL VALUES (1,'HR','Anikta','lal','kumar',1);

INSERT into td_cookbook.EMP_SAL VALUES (2,'HR','Anik','kumar','kumar',2);

INSERT into td_cookbook.EMP_SAL VALUES (3,'IT','Arjun','sharma','lal',1);

INSERT into td_cookbook.EMP_SAL VALUES (4,'SALES','Billa','Suti','raj',2);

INSERT into td_cookbook.EMP_SAL VALUES (4,'IT','Koyd','Loud','harlod',1);

INSERT into td_cookbook.EMP_SAL VALUES (2,'HR','Harlod','lal','kumar',1);

How to do it...

You need to connect to the Teradata database using SQLA or Studio.

Let's check the explain plan of the following query shown in the figure:

3-ways-to-use-indexes-in-teradata-to-improve-database-performance-img-3

/*EXPLAIN of SELECT*/

Explain sel id,emp_Fname from td_cookbook.EMP_SAL;

1) First, we lock td_cookbook.EMP_SAL for read on a reserved RowHash to prevent global deadlock.

2) Next, we lock td_cookbook.EMP_SAL for read.

3) We do an all-AMPs RETRIEVE step from td_cookbook.EMP_SAL by way of an all-rows scan with no residual conditions into Spool 1 (group_amps), which is built locally on the AMPs. The size of Spool 1 is estimated with high confidence to be 6 rows (210 bytes). The estimated time for this step is 0.04 seconds.

4) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. -> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0.04 seconds.

Now let's create a hash join index on the EMP_SAL table:

/*Hash Indx*/

CREATE HASH INDEX td_cookbook.EMP_HASH_inx

(id, DEPT) ON td_cookbook.EMP_SAL

BY (id)

ORDER BY HASH (id);

Let's now check the explain plan on the select query after the hash index creation:

/*Select after hash idx*/

EXPLAIN SELCT id,dept from td_cookbook.EMP_SAL

1) First, we lock td_cookbooK.EMP_HASH_INX for read on a reserved RowHash to prevent global deadlock.

2) Next, we lock td_cookbooK.EMP_HASH_INX for read.

3) We do an all-AMPs RETRIEVE step from td_cookbooK.EMP_HASH_INX by way of an all-rows scan with no residual conditions into Spool 1 (group_amps), which is built locally on the AMPs. The size of Spool 1 is estimated with high confidence to be 6 rows (210 bytes). The estimated time for this step is 0.04 seconds.

4) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request.

-> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0.04 seconds.

Explain plan can be see in the snippet from SQLA:

3-ways-to-use-indexes-in-teradata-to-improve-database-performance-img-4

How it works...

Points to consider about the hash index definition are:

Each hash index row contains the department id and the department name.

Specifying the department id is unnecessary, since it is the primary index of the base table and will therefore be automatically included.

The BY clause indicates that the rows of this index will be distributed by the department id hash value.

The ORDER BY clause indicates that the index rows will be ordered on each AMP in sequence by the department id hash value. The column specified in the BY clause should be part of the columns which make up the hash index. The BY clause comes with the ORDER BY clause.

Unlike join indexes, hash indexes can only be on a single table.

We explored how to create different types of index to bring up maximum performance in your database queries. If this article made your way, do check out the book Teradata Cookbook and gain confidence in running a wide variety of Data analytics to develop applications for the Teradata environment.