You're reading from Learning Apache Cassandra Managing fault-tolerant, scalable data with high performance

Product type Paperback

Published in Apr 2017

Publisher

ISBN-13 9781787127296

Length 360 pages

Edition 2nd Edition

Languages

Java

Tools

Cassandra

Concepts

Databases

Author (1):

Sandeep Yarabarla

View More author details

What is Cassandra and why Cassandra?

Cassandra is a fully distributed, masterless database, offering superior scalability, and fault tolerance to traditional single-master databases. Compared with other popular distributed databases such as Riak, HBase, and Voldemort, Cassandra offers a uniquely robust and expressive interface for modeling and querying data. What follows is an overview of several desirable database capabilities, with accompanying discussions of what Cassandra has to offer in each category. Apart from the following features, Apache Cassandra has a large user base including some top technology firms such as Apple, Netflix, and Instagram, who also open-source new features. It is being actively developed and has one of largest open source communities with close to 200 contributors and over 20,000 commits.

Horizontal scalability

Horizontal scalability refers to the ability to expand the storage and processing capacity of a database by adding more servers to a database cluster. A traditional single-master database's storage capacity is limited by the capacity of the server that hosts the master instance. If the data set outgrows this capacity and a more powerful server isn't available, the data set must be shared among multiple independent database instances that know nothing of each other. Your application bears responsibility for knowing to which instance a given piece of data belongs.

Cassandra, on the other hand, is deployed as a cluster of instances that are all aware of each other. From the client application's standpoint, the cluster is a single entity; the application need not know, nor care, which machine a piece of data belongs to. Instead, data can be read or written to any instance in the cluster, referred to as a node; this node will forward the request to the instance where the data actually belongs. The result is that Cassandra deployments have an almost limitless capacity to store and process data. When additional capacity is required, more machines can simply be added to the cluster. When new machines join the cluster, Cassandra takes care  of rebalancing the existing data so that each node in the expanded cluster has a roughly equal share. Also, the performance of a Cassandra cluster is directly proportional to the number of nodes within the cluster. As you keep on adding instances, the read and write throughput will keep increasing linearly.

Cassandra is one of the several popular distributed databases inspired by the Dynamo architecture, originally published in a paper by Amazon. Other widely used implementations of Dynamo include Riak and Voldemort. You can read the original paper at http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf.

High availability

The simplest database deployments are run as a single instance on a single server. This sort of configuration is highly vulnerable to interruption: if the server is affected by a hardware failure or network connection outage, the application's ability to read and write data is completely lost until the server is restored. If the failure is catastrophic, the data on that server might be lost completely.

Master-slave architectures improve this picture a bit. The master instance receives all write operations, and then these operations are replicated to follower instances. The application can read data from the master or any of the follower instances, so a single host becoming unavailable will not prevent the application from continuing to read the data. A failure of the master, however, will still prevent the application from performing any write operations, so while this configuration provides high read availability, it doesn't completely provide high availability.

Cassandra, on the other hand, has no single point of failure for reading or writing data. Each piece of data is replicated to multiple nodes, but none of these nodes holds the authoritative master copy. All the nodes in a Cassandra cluster are peers without a master node. If a machine becomes unavailable, Cassandra will continue writing data to the other nodes that share data with that machine and will queue the operations and update the failed node when it rejoins the cluster. This means that in a typical configuration, multiple nodes must fail simultaneously for there to be any application - visible interruption in Cassandra's availability.

How many copies?
When you create a keyspace, Cassandra's version of a database, you specify how many copies of each piece of data should be stored; this is called the replication factor. A replication factor of 3 is a common choice for most use cases.

The image below perfectly illustrates high availability in Cassandra. Clients reconnect to different node in case a node is down:

Write optimization

Traditional relational and document databases are optimized for read performance. Writing data to a relational database will typically involve making in - place updates to complicated data structures on disk, in order to maintain a data structure that can be read efficiently and flexibly. Updating these data structures is a very expensive operation from a standpoint of disk I/O, which is often the limiting factor for database performance. Since writes are more expensive than reads, you'll typically avoid any unnecessary updates to a relational database, even at the expense of extra read operations.

Cassandra, on the other hand, is highly optimized for write throughput and, in fact, never modifies data on disk; it only appends to existing files or creates new ones. This is much easier on disk I/O and means that Cassandra can provide astonishingly high write throughput. Since both writing data to Cassandra and storing data in Cassandra are inexpensive, denormalization carries little cost and is a good way to ensure that data can be efficiently read in various access scenarios.

Because Cassandra is optimized for write volume, you shouldn't shy away from writing data to the database. In fact, it's most efficient to write without reading whenever possible, even if doing so might result in redundant updates.

Just because Cassandra is optimized for writes doesn't make it bad at reads; in fact, a well-designed Cassandra database can handle very heavy read loads with no problem. We'll cover the topic of efficient data modeling in great depth in the next few chapters.

Structured records

The first three database features we looked at are commonly found in distributed data stores. However, databases such as Riak and Voldemort are purely key value stores; these databases have no knowledge of the internal structure of a record that's stored in a particular key. This means useful functions such as updating only part of a record, reading only certain fields from a record, or retrieving records that contain a particular value in a given field are not possible.

Relational databases such as PostgreSQL, document stores such as MongoDB, and, to a limited extent, newer key-value stores such as Redis, do have a concept of the internal structure of their records, and most application developers are accustomed to taking advantage of the possibilities this allows. None of these databases, however, offer the advantages of a masterless distributed architecture.

In Cassandra, records are structured much in the same way as they are in a relational database—using tables, rows, and columns. Thus, applications using Cassandra can enjoy all the benefits of masterless distributed storage while also getting all the advanced data modeling and access features associated with structured records.

Secondary indexes

A secondary index, commonly referred to as an index in the context of a relational database, is a structure allowing efficient lookup of records by some attribute other than their primary key. This is a widely useful capability; for instance, when developing a blog application, you would want to be able to easily retrieve all of the posts written by a particular author. Cassandra supports secondary indexes; while Cassandra's version is not as versatile as indexes in a typical relational database, it's  a powerful feature in the right circumstances. A query on a secondary index can perform poorly in certain cases; hence, it should be used sparingly and only in certain cases, which we will touch upon later in Chapter 5, Establishing Relationships.

Materialized views

Data modeling principles in Cassandra compel us to denormalize data as much as possible. Prior to Cassandra 3.0, the only way to query on a non-primary key column was to create a secondary index and query on it. However, secondary indexes have a performance trade-off if they contain high cardinality data. Often, high cardinality secondary indexes have to scan data on all the nodes and aggregate them to return the query results. This defeats the purpose of having a distributed system.

To avoid secondary indexes and client-side denormalization, Cassandra introduced the feature of materialized views, which does server side denormalization. You can create views for a base table and Cassandra ensures eventual consistency between the base and view. This lets us do very fast lookups on each view following the normal Cassandra read path. Materialized views maintain a correspondence of one CQL row each in the base and the view, so we need to ensure that each CQL row that is required for the views will be reflected in the base table's primary keys. Although a materialized view allows for fast lookups on non-primary key indexes, this comes at a performance hit to writes. Also, using secondary indexes and materialized views increases the disk usage by a considerable margin. Thus, it is important to take this into consideration when sizing your cluster.

Efficient result ordering

It's quite common to want to retrieve a record set ordered by a particular field; for instance, a photo-sharing service will want to retrieve the most recent photographs in descending order of creation. Since sorting data on the fly is a fundamentally expensive operation, databases must keep information about record ordering persisted on disk in order to efficiently return results in order. In a relational database, this is one of the jobs of a secondary index.

In Cassandra, secondary indexes can't be used for result ordering, but tables can be structured such that rows are always kept sorted by a given column or columns, called clustering columns. Sorting by arbitrary columns at read time is not possible, but the capacity to efficiently order records in any way and to retrieve ranges of records based on this ordering is an unusually powerful capability for a distributed database.

Immediate consistency

When we write a piece of data to a database, it is our hope that that data is immediately available to any other process that may wish to read it. From another point of view, when we read some data from a database, we would like to be guaranteed that the data we retrieve is the most recently updated version. This guarantee is called immediate consistency, and it's a property of most common single-master databases such as MySQL and PostgreSQL.

Distributed systems such as Cassandra typically do not provide an immediate consistency guarantee. Instead, developers must be willing to accept eventual consistency, which means when data is updated, the system will reflect that update at some point in the future. Developers are willing to give up immediate consistency precisely because it is a direct trade-off with high availability.

In the case of Cassandra, that trade-off is made explicit through tunable consistency. Each time you design a write or read path for data, you have the option of immediate consistency with less resilient availability, or eventual consistency with extremely resilient availability. We'll cover consistency tuning in great detail in Chapter 10, How Cassandra Distributes Data.

Discretely writable collections

While it's useful for records to be internally structured into discrete fields, a given property of a record isn't always a single value such as a string or an integer. One simple way to handle fields that contain collections of values is to serialize them using a format such as JSON and then save the serialized collection into a text field. However, in order to update collections stored in this way, the serialized data must be read from the database, decoded, modified, and then written back to the database in its entirety. If two clients try to perform this kind of modification to the same record concurrently, one of the updates will be overwritten by the other. For this reason, many databases offer built-in collection structures that can be discretely updated: values can be added to and removed from collections without reading and rewriting the entire collection. Cassandra is no exception, offering list, set, and map collections, and supporting operations such as append the number 3 to the end of this list. Neither the client nor Cassandra itself needs to read the current state of the collection in order to update it, meaning collection updates are also blazingly efficient.

Relational joins

In real-world applications, different pieces of data relate to each other in a variety of ways. Relational databases allow us to perform queries that make these relationships explicit; for instance, to retrieve a set of events whose location is in the state of New York (this is assuming events and locations are different record types). Cassandra, however, is not a relational database and does not support anything such as joins. Instead, applications using Cassandra typically denormalize data and make clever use of clustering in order to perform the sorts of data access that would use a join in a relational database.

For data sets that aren't already denormalized, applications can also perform client-side joins, which mimic the behavior of a relational database by performing multiple queries and joining the results at the application level. Client-side joins are less efficient than reading data that has been denormalized in advance, but they offer more flexibility. We'll cover both of these approaches in Chapter 6, Denormalizing Data for Maximum Performance.