Understanding the secondary index and projections should go hand in hand because of the fact that a secondary index cannot be used efficiently without specifying projection. In this article by Uchit Vyas and Prabhakaran Kuppusamy, authors of DynamoDB Applied Design Patterns, we will take a look at local and global secondary indexes, and projection and its usage with indexes.
(For more resources related to this topic, see here.)
The use of projection in DynamoDB is pretty much similar to that of traditional databases. However, here are a few things to watch out for:
A quick question: while writing a query in any database, keeping the primary key field as part of the query (especially in the where condition) will return results much faster compared to the other way. Why? This is because of the fact that an index will be created automatically in most of the databases for the primary key field. This the case with DynamoDB also. This index is called the primary index of the table. There is no customization possible using the primary index, so the primary index is seldom discussed.
In order to make retrieval faster, the frequently-retrieved attributes need to be made as part of the index. However, a DynamoDB table can have only one primary index and the index can have a maximum of two attributes (hash and range key). So for faster retrieval, the user should be given privileges to create user-defined indexes. This index, which is created by the user, is called the secondary index. Similar to the table key schema, the secondary index also has a key schema. Based on the key schema attributes, the secondary index can be either a local or global secondary index. Whenever a secondary index is created, during every item insertion, the items in the index will be rearranged. This rearrangement will happen for each item insertion into the table, provided the item contains both the index's hash and range key attribute.
Once we have an understanding of the secondary index, we are all set to learn about projection. While creating the secondary index, it is mandatory to specify the hash and range attributes based on which the index is created. Apart from these two attributes, if the query wants one or more attribute (assuming that none of these attributes are projected into the index), then DynamoDB will scan the entire table. This will consume a lot of throughput capacity and will have comparatively higher latency.
The following is the table (with some data) that is used to store book information:
Here are few more details about the table:
While creating the secondary index, the hash and range key of the table and index will be inserted into the index; optionally, the user can specify what other attributes need to be added. There are three kinds of projection possible in DynamoDB:
The following code shows the creation of a local secondary index named Idx_PubDate with BookTitle as the hash key (which is a must in the case of a local secondary index), PubDate as the range key, and using the KEYS_ONLY projection:
private static LocalSecondaryIndex getLocalSecondaryIndex() {
ArrayList<KeySchemaElement> indexKeySchema =
newArrayList<KeySchemaElement>();
indexKeySchema.add(new KeySchemaElement()
.withAttributeName("BookTitle")
.withKeyType(KeyType.HASH));
indexKeySchema.add(new KeySchemaElement()
.withAttributeName("PubDate")
.withKeyType(KeyType.RANGE));
LocalSecondaryIndex lsi = new LocalSecondaryIndex()
.withIndexName("Idx_PubDate")
.withKeySchema(indexKeySchema)
.withProjection(new Projection()
.withProjectionType("KEYS_ONLY"));
return lsi;
}
The usage of the KEYS_ONLY index type will create the smallest possible index and the usage of ALL will create the biggest possible index. We will discuss the trade-offs between these index types a little later.
Going back to our example, let us assume that we are using the KEYS_ONLY index type, so none of the attributes (other than the previous three key attributes) are projected into the index. So the index will look as follows:
You may notice that the row order of the index is almost the same as that of the table order (except the second and third rows). Here, you can observe one point: the table records will be grouped primarily based on the hash key, and then the records that have the same hash key will be ordered based on the range key of the index. In the case of the index, even though the table's range key is part of the index attribute, it will not play any role in the ordering (only the index's hash and range keys will take part in the ordering).
There is a negative in this approach. If the user is writing a query using this index to fetch BookTitle and Publisher with PubDate as 28-Dec-2008, then what happens? Will DynamoDB complain that the Publisher attribute is not projected into the index? The answer is no. The reason is that even though Publisher is not projected into the index, we can still retrieve it using the secondary index. However, retrieving a nonprojected attribute will scan the entire table. So if we are sure that certain attributes need to be fetched frequently, then we must project it into the index; otherwise, it will consume a large number of capacity units and retrieval will be much slower as well.
One more question: if the user is writing a query using the local secondary index to fetch BookTitle and Publisher with PubDate as 28-Dec-2008, then what happens? Will DynamoDB complain that the PubDate attribute is not part of the primary key and hence queries are not allowed on nonprimary key attributes? The answer is no. It is a rule of thumb that we can write queries on the secondary index attributes. It is possible to include nonprimary key attributes as part of the query, but these attributes must at least be key attributes of the index.
The following code shows how to add non-key attributes to the secondary index's projection:
private static Projection getProjectionWithNonKeyAttr() {
Projection projection = new Projection()
.withProjectionType(ProjectionType.INCLUDE);
ArrayList<String> nonKeyAttributes = new ArrayList<String>();
nonKeyAttributes.add("Language");
nonKeyAttributes.add("Author2");
projection.setNonKeyAttributes(nonKeyAttributes);
return projection;
}
There is a slight limitation with the local secondary index. If we write a query on a non-key (both table and index) attribute, then internally DynamoDB might need to scan the entire table; this is inefficient. For example, consider a situation in which we need to retrieve the number of editions of the books in each and every language. Since both of the attributes are non-key, even if we create a local secondary index with either of the attributes (Edition and Language), the query will still result in a scan operation on the entire table.
A problem arises here: is there any way in which we can create a secondary index using both the index keys that are different from the table's primary keys? The answer is the global secondary index. The following code shows how to create the global secondary index for this scenario:
private static GlobalSecondaryIndex getGlobalSecondaryIndex() {
GlobalSecondaryIndex gsi = new GlobalSecondaryIndex()
.withIndexName("Idx_Pub_Edtn")
.withProvisionedThroughput(new ProvisionedThroughput()
.withReadCapacityUnits((long) 1)
.withWriteCapacityUnits((long) 1))
.withProjection(newProjection().withProjectionType
("KEYS_ONLY"));
ArrayList<KeySchemaElement> indexKeySchema1 =
newArrayList<KeySchemaElement>();
indexKeySchema1.add(new KeySchemaElement()
.withAttributeName("Language")
.withKeyType(KeyType.HASH));
indexKeySchema1.add(new KeySchemaElement()
.withAttributeName("Edition")
.withKeyType(KeyType.RANGE));
gsi.setKeySchema(indexKeySchema1);
return gsi;
}
While deciding the attributes to be projected into a global secondary index, there are trade-offs we must consider between provisioned throughput and storage costs. A few of these are listed as follows:
Sharding, also called horizontal partitioning, is a technique in which rows are distributed among the database servers to perform queries faster. In the case of sharding, a hash operation will be performed on the table rows (mostly on one of the columns) and, based on the hash operation output, the rows will be grouped and sent to the proper database server. Take a look at the following diagram:
As shown in the previous diagram, if all the table data (only four rows and one column are shown for illustration purpose) is stored in a single database server, the read and write operations will become slower and the server that has the frequently accessed table data will work more compared to the server storing the table data that is not accessed frequently.
The following diagram shows the advantage of sharding over a multitable, multiserver database environment:
In the previous diagram, two tables (Tbl_Places and Tbl_Sports) are shown on the left-hand side with four sample rows of data (Austria.. means only the first column of the first item is illustrated and all other fields are represented by ..).We are going to perform a hash operation on the first column only. In DynamoDB, this hashing will be performed automatically. Once the hashing is done, similar hash rows will be saved automatically in different servers (if necessary) to satisfy the specified provisioned throughput capacity.
Have you ever wondered about the importance of the hash type key while creating a table (which is mandatory)? Of course we all know the importance of the range key and what it does. It simply sorts items based on the range key value. So far, we might have been thinking that the range key is more important than the hash key. If you think that way, then you may be correct, provided we neither need our table to be provisioned faster nor do we need to create any partitions for our table. As long as the table data is smaller, the importance of the hash key will be realized only while writing a query operation. However, once the table grows, in order to satisfy the same provision throughput, DynamoDB needs to partition our table data based on this hash key (as shown in the previous diagram).
This partitioning of table items based on the hash key attribute is called sharding. It means the partitions are created by splitting items and not attributes. This is the reason why a query that has the hash key (of table and index) retrieves items much faster.
Since the number of partitions is managed automatically by DynamoDB, we cannot just hope for things to work fine. We also need to keep certain things in mind, for example, the hash key attribute should have more distinct values. To simplify, it is not advisable to put binary values (such as Yes or No, Present or Past or Future, and so on) into the hash key attributes, thereby restricting the number of partitions. If our hash key attribute has either Yes or No values in all the items, then DynamoDB can create only a maximum of two partitions; therefore, the specified provisioned throughput cannot be achieved.
Just consider that we have created a table called Tbl_Sports with a provisioned throughput capacity of 10, and then we put 10 items into the table. Assuming that only a single partition is created, we are able to retrieve 10 items per second. After a point of time, we put 10 more items into the table. DynamoDB will create another partition (by hashing over the hash key), thereby satisfying the provisioned throughput capacity. There is a formula taken from the AWS site:
Total provisioned throughput/partitions = throughput per partition
OR
No. of partitions = Total provisioned throughput/throughput per partition
In order to satisfy throughput capacity, the other parameters will be automatically managed by DynamoDB.
In this article, we saw what the local and global secondary indexes are. We walked through projection and its usage with indexes.
Further resources on this subject: