In this tutorial, we will create solutions to design indexes to help us improve query performance of Teradata database management system.
[box type="note" align="" class="" width=""]This article is an excerpt from a book co-authored by Abhinav Khandelwal and Rajsekhar Bhamidipati titled Teradata Cookbook. This book will teach you to tackle problems related to efficient querying, stored procedure searching, and navigation techniques in a Teradata database.[/box]
A PPI (partitioned primary index) is a type of index that enables users to set up databases that provide performance benefits from a data locality, while retaining the benefits of scalability inherent in the hash architecture of the Teradata database. This is achieved by hashing rows to different virtual AMPs, as is done with a normal PI, but also by creating local partitions within each virtual AMP.
We will see how PPIs will improve the performance of a query.
You need to connect to the Teradata database. Let's create a table and insert data into it using the following DDL. This will be a non-partitioned table, as follows:
/*NON PPI TABLE DDL*/
CREATE volatile TABLE EMP_SAL_NONPPI
(
id INT,
Sal INT,
dob DATE,
o_total INT
) primary index( id)
on commit preserve rows;
INSERT into EMP_SAL_NONPPI VALUES (1001,2500,'2017-09-01',890);
INSERT into EMP_SAL_NONPPI VALUES (1002,5500,'2017-09-10',890);
INSERT into EMP_SAL_NONPPI VALUES (1003,500,'2017-09-02',890);
INSERT into EMP_SAL_NONPPI VALUES (1004,54500,'2017-09-05',890);
INSERT into EMP_SAL_NONPPI VALUES (1005,900,'2017-09-23',890);
INSERT into EMP_SAL_NONPPI VALUES (1006,8900,'2017-08-03',890);
INSERT into EMP_SAL_NONPPI VALUES (1007,8200,'2017-08-21',890);
INSERT into EMP_SAL_NONPPI VALUES (1008,6200,'2017-08-06',890);
INSERT into EMP_SAL_NONPPI VALUES (1009,2300,'2017-08-12',890);
INSERT into EMP_SAL_NONPPI VALUES (1010,9200,'2017-08-15',890);
Let's check the explain plan of the following query; we are selecting data based on the DOB column using the following code:
/*Select on NONPPI table*/
SELECT * from EMP_SAL_NONPPI
where dob <= 2017-08-01
Following is the snippet from SQLA showing explain plan of the query:
As seen in the following explain plan, an all-rows scan can be costly in terms of CPU and I/O if the table has millions of rows:
Explain SELECT * from EMP_SAL_NONPPI
where dob <= 2017-08-01;
/*EXPLAIN PLAN of SELECT*/
1) First, we do an all-AMPs RETRIEVE step from DBC.EMP_SAL_NONPPI by way of an all-rows scan with a condition of ("DBC.EMP_SAL_NONPPI.dob <= DATE '1900-12-31'") into Spool 1 (group_amps), which is built locally on the AMPs. The size of Spool 1 is estimated with no confidence to be 4 rows (148 bytes). The estimated time for this step is 0.04 seconds.
2) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. -> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0.04 seconds.
Let's see how we can enable partition retrieval in the same query.
DOB:
/*Partition table*/
CREATE volatile TABLE EMP_SAL_PPI
(
id INT,
Sal int,
dob date,
o_total int
) primary index( id)
PARTITION BY RANGE_N (dob BETWEEN DATE '2017-01-01'
AND DATE '2017-12-01' EACH INTERVAL '1' DAY)
on commit preserve rows;
INSERT into EMP_SAL_PPI VALUES (1001,2500,'2017-09-01',890);
INSERT into EMP_SAL_PPI VALUES (1002,5500,'2017-09-10',890);
INSERT into EMP_SAL_PPI VALUES (1003,500,'2017-09-02',890);
INSERT into EMP_SAL_PPI VALUES (1004,54500,'2017-09-05',890);
INSERT into EMP_SAL_PPI VALUES (1005,900,'2017-09-23',890);
INSERT into EMP_SAL_PPI VALUES (1006,8900,'2017-08-03',890);
INSERT into EMP_SAL_PPI VALUES (1007,8200,'2017-08-21',890);
INSERT into EMP_SAL_PPI VALUES (1008,6200,'2017-08-06',890);
INSERT into EMP_SAL_PPI VALUES (1009,2300,'2017-08-12',890);
INSERT into EMP_SAL_PPI VALUES (1010,9200,'2017-08-15',890);
/*SELECT on PPI table*/
sel * from EMP_SAL_PPI
where dob <= 2017-08-01
Following snippet from SQLA shows query and explain plan of the query:
/*EXPLAIN PLAN*/
1) First, we do an all-AMPs RETRIEVE step from a single partition
of
SYSDBA.EMP_SAL_PPI with a condition of
("SYSDBA.EMP_SAL_PPI.dob =
DATE '2017-08-01'") with a residual condition of (
"SYSDBA.EMP_SAL_PPI.dob = DATE '2017-08-01'") into Spool 1
(group_amps), which is built locally on the AMPs. The size of
Spool 1 is estimated with no confidence to be 1 row (37
bytes).
The estimated time for this step is 0.04 seconds. -> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0.04 seconds.
A partitioned PI helps in improving the performance of a query by avoiding a full table scan elimination. A PPI works the same as a primary index for data distribution, but creates partitions according to ranges or cases, as specified in the table. There are four types of PPI that can be created in a table:
/*CASE partition*/
CREATE TABLE SALES_CASEPPI
(
ORDER_ID INTEGER,
CUST_ID INTERGER,
ORDER_DT DATE,
)
PRIMARY INDEX(ORDER_ID)
PARTITION BY CASE_N(ORDER_ID < 101,
ORDER_ ID < 201,
ORDER_ID < 501,
NO CASE,UNKNOWN);
/*Range Partition table*/
CREATE volatile TABLE EMP_SAL_PPI
(
id INT,
Sal int,
dob date,
o_total int
) primary index( id)
PARTITION BY RANGE_N (dob BETWEEN DATE '2017-01-01'
AND DATE '2017-12-01' EACH INTERVAL '1' DAY)
on commit preserve rows
CREATE TABLE SALES_MLPPI_TABLE
(
ORDER_ID INTEGER NOT NULL,
CUST_ID INTERGER,
ORDER_DT DATE,
)
PRIMARY INDEX(ORDER_ID)
PARTITION BY (RANGE_N(ORDER_DT BETWEEN DATE '2017-08-01' AND DATE
'2017-12-31'
EACH INTERVAL '1' DAY)
CASE_N (ORDER_ID < 1001,
ORDER_ID < 2001,
ORDER_ID < 3001,
NO CASE, UNKNOWN));
/*CHAR Partition*/
CREATE TABLE SALES_CHAR_PPI (
ORDR_ID INTEGER,
EMP_NAME VARCHAR (30) CHARACTER,
PRIMARY INDEX (ORDR_ID)
PARTITION BY CASE_N (
EMP_NAME LIKE 'A%', EMP_NAME LIKE 'B%',
EMP_NAME LIKE 'C%', EMP_NAME LIKE 'D%',
EMP_NAME LIKE 'E%', EMP_NAME LIKE 'F%',
NO CASE, UNKNOWN);
PPI not only helps in improving the performance of queries, but also helps in table maintenance. But there are certain performance considerations that you might need to keep in mind when creating a PPI on a table, and they are:
A join index is a data structure that contains data from one or more tables, with or without aggregation:
In this, we will see how join indexes help in improving the performance of queries.
You need to connect to the Teradata database using SQLA or Studio. Let's create a table and insert the following code into it:
CREATE TABLE td_cookbook.EMP_SAL
(
id INT,
DEPT varchar(25),
emp_Fname varchar(25),
emp_Lname varchar(25),
emp_Mname varchar(25),
status INT
)primary index(id);
INSERT into td_cookbook.EMP_SAL VALUES (1,'HR','Anikta','lal','kumar',1);
INSERT into td_cookbook.EMP_SAL VALUES (2,'HR','Anik','kumar','kumar',2);
INSERT into td_cookbook.EMP_SAL VALUES (3,'IT','Arjun','sharma','lal',1);
INSERT into td_cookbook.EMP_SAL VALUES (4,'SALES','Billa','Suti','raj',2);
INSERT into td_cookbook.EMP_SAL VALUES (4,'IT','Koyd','Loud','harlod',1);
INSERT into td_cookbook.EMP_SAL VALUES (2,'HR','Harlod','lal','kumar',1);
Further, we will create a single table join index with a different primary index of the table.
The following are the steps to create a join index to improve performance:
/*SELECT on base table*/
EXPLAIN SELECT id,dept,emp_Fname,emp_Lname,status from
td_cookbook.EMP_SAL
where id=4;
1) First, we do a single-AMP RETRIEVE step from td_cookbook.EMP_SAL by way of the primary index "td_cookbook.EMP_SAL.id = 4" with no residual conditions into Spool 1 (one-amp), which is built locally on that AMP. The size of Spool 1 is estimated with low confidence to be 2 rows (118 bytes). The estimated time for this step is 0.02 seconds.
-> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0.02 seconds.
/*Join Index*/
CREATE JOIN INDEX td_cookbook.EMP_JI
AS
SELECT id,emp_Fname,emp_Lname,status,emp_Mname,dept
FROM td_cookbook.EMP_SAL
PRIMARY INDEX(emp_Fname);
/*Collect stats on JI*/
collect stats td_cookbook.EMP_JI column emp_Fname
Explain sel id,dept,emp_Fname,emp_Lname,status from
td_cookbook.EMP_SAL
where emp_Fname='ankita';
1) First, we do a single-AMP RETRIEVE step from td_cookbooK.EMP_JI by way of the primary index "td_cookbooK.EMP_JI.emp_Fname = 'ankita'" with no residual conditions into Spool 1 (one-amp), which is built locally on that AMP. The size of Spool 1 is estimated with low confidence to be 2 rows (118 bytes). The estimated time for this step is 0.02 seconds. -> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0.02 seconds.
Query performance improves any time a join index can be used instead of the base tables. A join index is most useful when its columns can satisfy, or cover, most or all of the requirements in a query.
For example, the optimizer may consider using a covering index instead of performing a merge join.
When we are able to cover all the queried columns that can be satisfied by a join index, then it is called a cover query.
Covering indexes improve the speed of join queries. The extent of improvement can be dramatic, especially for queries involving complex, large-table, and multiple-table joins. The extent of such improvement depends on how often an index is appropriate to a query.
There are a few more join indexes that can be used in Teradata:
/*AG JOIN INDEX*/
CREATE JOIN INDEX Agg_Join_Index AS
SELECT Cust_ID,
Order_ID,
SUM(Sales_north) -- Aggregate column
FROM
sales_table
GROUP BY 1,2
Primary Index(Cust_ID)
Use FLOAT as a data type for COUNT and SUM to avoid overflow.
/*SP JOIN INDEX*/
CREATE JOIN INDEX Sparse_Join_Index AS
SELECT Cust_ID,
Order_ID,
SUM(Sales_north) -- Aggregate column
FROM
sales_table
where Order_id = 1 -- WHERE CLAUSE
GROUP BY 1,2
Primary Index(Cust_ID)
Hash indexes are designed to improve query performance like join indexes, especially single table join indexes, and in addition, they enable you to avoid accessing the base table. The syntax for the hash index is as follows:
/*Hash index syntax*/
CREATE HASH INDEX <hash-index-name>
[, <fallback-option>]
(<column-name-list1>) ON <base-table>
[BY (<partition-column-name-list2>)]
[ORDER BY <index-sort-spec>] ;
You need to connect to the Teradata database. Let's create a table and insert data into it using the following DDL:
/*Create table with data*/
CREATE TABLE td_cookbook.EMP_SAL
(
id INT,
DEPT varchar(25),
emp_Fname varchar(25),
emp_Lname varchar(25),
emp_Mname varchar(25),
status INT
)primary index(id);
INSERT into td_cookbook.EMP_SAL VALUES (1,'HR','Anikta','lal','kumar',1);
INSERT into td_cookbook.EMP_SAL VALUES (2,'HR','Anik','kumar','kumar',2);
INSERT into td_cookbook.EMP_SAL VALUES (3,'IT','Arjun','sharma','lal',1);
INSERT into td_cookbook.EMP_SAL VALUES (4,'SALES','Billa','Suti','raj',2);
INSERT into td_cookbook.EMP_SAL VALUES (4,'IT','Koyd','Loud','harlod',1);
INSERT into td_cookbook.EMP_SAL VALUES (2,'HR','Harlod','lal','kumar',1);
/*EXPLAIN of SELECT*/
Explain sel id,emp_Fname from td_cookbook.EMP_SAL;
1) First, we lock td_cookbook.EMP_SAL for read on a reserved RowHash to prevent global deadlock.
2) Next, we lock td_cookbook.EMP_SAL for read.
3) We do an all-AMPs RETRIEVE step from td_cookbook.EMP_SAL by way of an all-rows scan with no residual conditions into Spool 1 (group_amps), which is built locally on the AMPs. The size of Spool 1 is estimated with high confidence to be 6 rows (210 bytes). The estimated time for this step is 0.04 seconds.
4) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. -> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0.04 seconds.
/*Hash Indx*/
CREATE HASH INDEX td_cookbook.EMP_HASH_inx
(id, DEPT) ON td_cookbook.EMP_SAL
BY (id)
ORDER BY HASH (id);
/*Select after hash idx*/
EXPLAIN SELCT id,dept from td_cookbook.EMP_SAL
1) First, we lock td_cookbooK.EMP_HASH_INX for read on a reserved RowHash to prevent global deadlock.
2) Next, we lock td_cookbooK.EMP_HASH_INX for read.
3) We do an all-AMPs RETRIEVE step from td_cookbooK.EMP_HASH_INX by way of an all-rows scan with no residual conditions into Spool 1 (group_amps), which is built locally on the AMPs. The size of Spool 1 is estimated with high confidence to be 6 rows (210 bytes). The estimated time for this step is 0.04 seconds.
4) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request.
-> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0.04 seconds.
Explain plan can be see in the snippet from SQLA:
Points to consider about the hash index definition are:
Unlike join indexes, hash indexes can only be on a single table.
We explored how to create different types of index to bring up maximum performance in your database queries. If this article made your way, do check out the book Teradata Cookbook and gain confidence in running a wide variety of Data analytics to develop applications for the Teradata environment.
Why MongoDB is the most popular NoSQL database today
Why Oracle is losing the Database Race
Using the Firebase Real-Time Database