Learning Apache Cassandra

Chapter 1. Getting Up and Running with Cassandra

As an application developer, you have almost certainly worked with databases extensively. You must have built products using relational databases like MySQL and PostgreSQL, and perhaps experimented with a document store like MongoDB or a key-value database like Redis. While each of these tools has its strengths, you will now consider whether a distributed database like Cassandra might be the best choice for the task at hand.

In this chapter, we'll talk about the major reasons to choose Cassandra from among the many database options available to you. Having established that Cassandra is a great choice, we'll go through the nuts and bolts of getting a local Cassandra installation up and running. By the end of this chapter, you'll know:

When and why Cassandra is a good choice for your application
How to install Cassandra on your development machine
How to interact with Cassandra using cqlsh
How to create a keyspace

What Cassandra offers, and what it doesn't

Cassandra is a fully distributed, masterless database, offering superior scalability and fault tolerance to traditional single master databases. Compared with other popular distributed databases like Riak, HBase, and Voldemort, Cassandra offers a uniquely robust and expressive interface for modeling and querying data. What follows is an overview of several desirable database capabilities, with accompanying discussions of what Cassandra has to offer in each category.

Horizontal scalability

Horizontal scalability refers to the ability to expand the storage and processing capacity of a database by adding more servers to a database cluster. A traditional single-master database's storage capacity is limited by the capacity of the server that hosts the master instance. If the data set outgrows this capacity, and a more powerful server isn't available, the data set must be sharded among multiple independent database instances that know nothing of each other. Your application bears responsibility for knowing to which instance a given piece of data belongs.

Cassandra, on the other hand, is deployed as a cluster of instances that are all aware of each other. From the client application's standpoint, the cluster is a single entity; the application need not know, nor care, which machine a piece of data belongs to. Instead, data can be read or written to any instance in the cluster, referred to as a node; this node will forward the request to the instance where the data actually belongs.

The result is that Cassandra deployments have an almost limitless capacity to store and process data; when additional capacity is required, more machines can simply be added to the cluster. When new machines join the cluster, Cassandra takes care of rebalancing the existing data so that each node in the expanded cluster has a roughly equal share.

Note

Cassandra is one of the several popular distributed databases inspired by the Dynamo architecture, originally published in a paper by Amazon. Other widely used implementations of Dynamo include Riak and Voldemort. You can read the original paper at http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf.

High availability

The simplest database deployments are run as a single instance on a single server. This sort of configuration is highly vulnerable to interruption: if the server is affected by a hardware failure or network connection outage, the application's ability to read and write data is completely lost until the server is restored. If the failure is catastrophic, the data on that server might be lost completely.

A master-follower architecture improves this picture a bit. The master instance receives all write operations, and then these operations are replicated to follower instances. The application can read data from the master or any of the follower instances, so a single host becoming unavailable will not prevent the application from continuing to read data. A failure of the master, however, will still prevent the application from performing any write operations, so while this configuration provides high read availability, it doesn't completely provide high availability.

Cassandra, on the other hand, has no single point of failure for reading or writing data. Each piece of data is replicated to multiple nodes, but none of these nodes holds the authoritative master copy. If a machine becomes unavailable, Cassandra will continue writing data to the other nodes that share data with that machine, and will queue the operations and update the failed node when it rejoins the cluster. This means in a typical configuration, two nodes must fail simultaneously for there to be any application-visible interruption in Cassandra's availability.

Tip

How many copies?

When you create a keyspace—Cassandra's version of a database—you specify how many copies of each piece of data should be stored; this is called the replication factor. A replication factor of 3 is a common and good choice for many use cases.

Write optimization

Traditional relational and document databases are optimized for read performance. Writing data to a relational database will typically involve making in-place updates to complicated data structures on disk, in order to maintain a data structure that can be read efficiently and flexibly. Updating these data structures is a very expensive operation from a standpoint of disk I/O, which is often the limiting factor for database performance. Since writes are more expensive than reads, you'll typically avoid any unnecessary updates to a relational database, even at the expense of extra read operations.

Cassandra, on the other hand, is highly optimized for write throughput, and in fact never modifies data on disk; it only appends to existing files or creates new ones. This is much easier on disk I/O and means that Cassandra can provide astonishingly high write throughput. Since both writing data to Cassandra, and storing data in Cassandra, are inexpensive, denormalization carries little cost and is a good way to ensure that data can be efficiently read in various access scenarios.

Note

Because Cassandra is optimized for write volume, you shouldn't shy away from writing data to the database. In fact, it's most efficient to write without reading whenever possible, even if doing so might result in redundant updates.

Just because Cassandra is optimized for writes doesn't make it bad at reads; in fact, a well-designed Cassandra database can handle very heavy read loads with no problem. We'll cover the topic of efficient data modeling in great depth in the next few chapters.

Structured records

The first three database features we looked at are commonly found in distributed data stores. However, databases like Riak and Voldemort are purely key-value stores; these databases have no knowledge of the internal structure of a record that's stored at a particular key. This means useful functions like updating only part of a record, reading only certain fields from a record, or retrieving records that contain a particular value in a given field are not possible.

Relational databases like PostgreSQL, document stores like MongoDB, and, to a limited extent, newer key-value stores like Redis do have a concept of the internal structure of their records, and most application developers are accustomed to taking advantage of the possibilities this allows. None of these databases, however, offer the advantages of a masterless distributed architecture.

In Cassandra, records are structured much in the same way as they are in a relational database—using tables, rows, and columns. Thus, applications using Cassandra can enjoy all the benefits of masterless distributed storage while also getting all the advanced data modeling and access features associated with structured records.

Secondary indexes

A secondary index, commonly referred to as an index in the context of a relational database, is a structure allowing efficient lookup of records by some attribute other than their primary key. This is a widely useful capability: for instance, when developing a blog application, you would want to be able to easily retrieve all of the posts written by a particular author. Cassandra supports secondary indexes; while Cassandra's version is not as versatile as indexes in a typical relational database, it's a powerful feature in the right circumstances.

Efficient result ordering

It's quite common to want to retrieve a record set ordered by a particular field; for instance, a photo sharing service will want to retrieve the most recent photographs in descending order of creation. Since sorting data on the fly is a fundamentally expensive operation, databases must keep information about record ordering persisted on disk in order to efficiently return results in order. In a relational database, this is one of the jobs of a secondary index.

In Cassandra, secondary indexes can't be used for result ordering, but tables can be structured such that rows are always kept sorted by a given column or columns, called clustering columns. Sorting by arbitrary columns at read time is not possible, but the capacity to efficiently order records in any way, and to retrieve ranges of records based on this ordering, is an unusually powerful capability for a distributed database.

Immediate consistency

When we write a piece of data to a database, it is our hope that that data is immediately available to any other process that may wish to read it. From another point of view, when we read some data from a database, we would like to be guaranteed that the data we retrieve is the most recently updated version. This guarantee is called immediate consistency, and it's a property of most common single-master databases like MySQL and PostgreSQL.

Distributed systems like Cassandra typically do not provide an immediate consistency guarantee. Instead, developers must be willing to accept eventual consistency, which means when data is updated, the system will reflect that update at some point in the future. Developers are willing to give up immediate consistency precisely because it is a direct tradeoff with high availability.

In the case of Cassandra, that tradeoff is made explicit through tunable consistency. Each time you design a write or read path for data, you have the option of immediate consistency with less resilient availability, or eventual consistency with extremely resilient availability. We'll cover consistency tuning in great detail in Chapter 10, How Cassandra Distributes Data.

Discretely writable collections

While it's useful for records to be internally structured into discrete fields, a given property of a record isn't always a single value like a string or an integer. One simple way to handle fields that contain collections of values is to serialize them using a format like JSON, and then save the serialized collection into a text field. However, in order to update collections stored in this way, the serialized data must be read from the database, decoded, modified, and then written back to the database in its entirety. If two clients try to perform this kind of modification to the same record concurrently, one of the updates will be overwritten by the other.

For this reason, many databases offer built-in collection structures that can be discretely updated: values can be added to, and removed from collections, without reading and rewriting the entire collection. Cassandra is no exception, offering list, set, and map collections, and supporting operations like "append the number 3 to the end of this list". Neither the client nor Cassandra itself needs to read the current state of the collection in order to update it, meaning collection updates are also blazingly efficient.

Relational joins

In real-world applications, different pieces of data relate to each other in a variety of ways. Relational databases allow us to perform queries that make these relationships explicit, for instance, to retrieve a set of events whose location is in the state of New York (this is assuming events and locations are different record types). Cassandra, however, is not a relational database, and does not support anything like joins. Instead, applications using Cassandra typically denormalize data and make clever use of clustering in order to perform the sorts of data access that would use a join in a relational database.

For data sets that aren't already denormalized, applications can also perform client-side joins, which mimic the behavior of a relational database by performing multiple queries and joining the results at the application level. Client-side joins are less efficient than reading data that has been denormalized in advance, but offer more flexibility. We'll cover both of these approaches in Chapter 6, Denormalizing Data for Maximum Performance.

MapReduce

MapReduce is a technique for performing aggregate processing on large amounts of data in parallel; it's a particularly common technique in data analytics applications. Cassandra does not offer built-in MapReduce capabilities, but it can be integrated with Hadoop in order to perform MapReduce operations across Cassandra data sets, or Spark for real-time data analysis. The DataStax Enterprise product provides integration with both of these tools out-of-the-box.

Comparing Cassandra to the alternatives

Now that you've got an in-depth understanding of the feature set that Cassandra offers, it's time to figure out which features are most important to you, and which database is the best fit. The following table lists a handful of commonly used databases, and key features that they do or don't have:

Feature	Cassandra	PostgreSQL	MongoDB	Redis	Riak
Structured records	Yes	Yes	Yes	Limited	No
Secondary indexes	Yes	Yes	Yes	No	Yes
Discretely writable collections	Yes	Yes	Yes	Yes	No
Relational joins	No	Yes	No	No	No
Built-in MapReduce	No	No	Yes	No	Yes
Fast result ordering	Yes	Yes	Yes	Yes	No
Immediate consistency	Configurable at query level	Yes	Yes	Yes	Configurable at cluster level
Transparent sharding	Yes	No	Yes	No	Yes
No single point of failure	Yes	No	No	No	Yes
High throughput writes	Yes	No	No	Yes	Yes

As you can see, Cassandra offers a unique combination of scalability, availability, and a rich set of features for modeling and accessing data.

Installing Cassandra

Now that you're acquainted with Cassandra's substantial powers, you're no doubt chomping at the bit to try it out. Happily, Cassandra is free, open source, and very easy to get running.

Installing on Mac OS X

First, we need to make sure that we have an up-to-date installation of the Java Runtime Environment. Open the Terminal application, and type the following into the command prompt:

$ java -version

You will see an output that looks something like the following:

java version "1.8.0_25"
Java(TM) SE Runtime Environment (build 1.8.0_25-b17)
Java HotSpot(TM) 64-Bit Server VM (build 25.25-b02, mixed mode)

Note

Pay particular attention to the java version listed: if it's lower than 1.7.0_25, you'll need to install a new version. If you have an older version of Java or if Java isn't installed at all, head to https://www.java.com/en/download/mac_download.jsp and follow the download instructions on the page.

You'll need to set up your environment so that Cassandra knows where to find the latest version of Java. To do this, set up your JAVA_HOME environment variable to the install location, and your PATH to include the executable in your new Java installation as follows:

$ export JAVA_HOME="/Library/Internet Plug- Ins/JavaAppletPlugin.plugin/Contents/Home"
$ export PATH="$JAVA_HOME/bin":$PATH

Tip

Downloading the example code

You can download the example code files from your account at http://www.packtpub.com for all the Packt Publishing books you have purchased. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

You should put these two lines at the bottom of your .bashrc file to ensure that things still work when you open a new terminal.

Note

The installation instructions given earlier assume that you're using the latest version of Mac OS X (at the time of writing this, 10.10 Yosemite). If you're running a different version of OS X, installing Java might require different steps. Check out https://www.java.com/en/download/faq/java_mac.xml for detailed installation information.

Once you've got the right version of Java, you're ready to install Cassandra. It's very easy to install Cassandra using Homebrew; simply type the following:

$ brew install cassandra
$ pip install cassandra-driver cql
$ cassandra

Here's what we just did:

Installed Cassandra using the Homebrew package manager
Installed the CQL shell and its dependency, the Python Cassandra driver
Started the Cassandra server

Installing on Ubuntu

First, we need to make sure that we have an up-to-date installation of the Java Runtime Environment. Open the Terminal application, and type the following into the command prompt:

$ java -version

You will see an output that looks similar to the following:

java version "1.7.0_65"

OpenJDK Runtime Environment (IcedTea 2.5.3) (7u71-2.5.3- 0ubuntu0.14.04.1)
OpenJDK 64-bit Server VM (build 24.65-b04, mixed mode)

Note

Pay particular attention to the java version listed: it should start with 1.7. If you have an older version of Java, or if Java isn't installed at all, you can install the correct version using the following command:

$ sudo apt-get install openjdk-7-jre-headless

Once you've got the right version of Java, you're ready to install Cassandra. First, you need to add Apache's Debian repositories to your sources list. Add the following lines to the file /etc/apt/sources.list:

deb http://www.apache.org/dist/cassandra/debian 21x main
deb-src http://www.apache.org/dist/cassandra/debian 21x main

In the Terminal application, type the following into the command prompt:

$ gpg --keyserver pgp.mit.edu --recv-keys F758CE318D77295D
$ gpg --export --armor F758CE318D77295D | sudo apt-key add -
$ gpg --keyserver pgp.mit.edu --recv-keys 2B5C1B00
$ gpg --export --armor 2B5C1B00 | sudo apt-key add -
$ gpg --keyserver pgp.mit.edu --recv-keys 0353B12C
$ gpg --export --armor 0353B12C | sudo apt-key add -
$ sudo apt-get update
$ sudo apt-get install cassandra
$ cassandra

Here's what we just did:

Added the Apache repositories for Cassandra 2.1 to our sources list
Added the public keys for the Apache repo to our system and updated our repository cache
Installed Cassandra
Started the Cassandra server

Installing on Windows

The easiest way to install Cassandra on Windows is to use the DataStax Community Edition. DataStax is a company that provides enterprise-level support for Cassandra; they also release Cassandra packages at both free and paid tiers. DataStax Community Edition is free, and does not differ from the Apache package in any meaningful way.

DataStax offers a graphical installer for Cassandra on Windows, which is available for download at planetcassandra.org/cassandra. On this page, locate Windows Server 2008/Windows 7 or Later (32-Bit) from the Operating System menu (you might also want to look for 64-bit if you run a 64-bit version of Windows), and choose MSI Installer (2.x) from the version columns.

Download and run the MSI file, and follow the instructions, accepting the defaults:

Once the installer completes this task, you should have an installation of Cassandra running on your machine.

Bootstrapping the project

Throughout the remainder of this book, we will build an application called MyStatus, which allows users to post status updates for their friends to read. In each chapter, we'll add new functionality to the MyStatus application; each new feature will also introduce a new aspect of Cassandra.

CQL – the Cassandra Query Language

Since this is a book about Cassandra and not targeted to users of any particular programming language or application framework, we will focus entirely on the database interactions that MyStatus will require. Code examples will be in Cassandra Query Language (CQL). Specifically, we'll use version 3.1.1 of CQL, which is available in Cassandra 2.0.6 and later versions.

As the name implies, CQL is heavily inspired by SQL; in fact, many CQL statements are equally valid SQL statements. However, CQL and SQL are not interchangeable. CQL lacks a grammar for relational features such as JOIN statements, which are not possible in Cassandra. Conversely, CQL is not a subset of SQL; constructs for retrieving the update time of a given column, or performing an update in a lightweight transaction, which are available in CQL, do not have an SQL equivalent.

Note

Throughout this book, you'll learn the important constructs of CQL. Once you've completed reading this book, I recommend you to turn to the DataStax CQL documentation for additional reference. This documentation is available at http://www.datastax.com/documentation/cql/3.1.

Interacting with Cassandra

Most common programming languages have drivers for interacting with Cassandra. When selecting a driver, you should look for libraries that support the CQL binary protocol, which is the latest and most efficient way to communicate with Cassandra.

Tip

The CQL binary protocol is a relatively new introduction; older versions of Cassandra used the Thrift protocol as a transport layer. Although Cassandra continues to support Thrift, avoid Thrift-based drivers, as they are less performant than the binary protocol.

Here are CQL binary drivers available for some popular programming languages:

Language	Driver	Available at
Java	DataStax Java Driver	github.com/datastax/java-driver
Python	DataStax Python Driver	github.com/datastax/python-driver
Ruby	DataStax Ruby Driver	github.com/datastax/ruby-driver
C++	DataStax C++ Driver	github.com/datastax/cpp-driver
C#	DataStax C# Driver	github.com/datastax/csharp-driver
JavaScript (Node.js)	node-cassandra-cql	github.com/jorgebay/node-cassandra-cql
PHP	phpbinarycql	github.com/rmcfrazier/phpbinarycql

While you will likely use one of these drivers in your applications, to try out the code examples in this book, you can simply use the cqlsh tool, which is a command-line interface for executing CQL queries and viewing the results. To start cqlsh on OS X or Linux, simply type cqlsh into your command line; you should see something like this:

$ cqlsh
Connected to Test Cluster at localhost:9160.
[cqlsh 4.1.1 | Cassandra 2.0.7 | CQL spec 3.1.1 | Thrift protocol 19.39.0]
Use HELP for help.
cqlsh>

On Windows, you can start cqlsh by finding the Cassandra CQL Shell application in the DataStax Community Edition group in your applications. Once you open it, you should see the same output we just saw.

Creating a keyspace

A keyspace is a collection of related tables, equivalent to a database in a relational system. To create the keyspace for our MyStatus application, issue the following statement in the CQL shell:

CREATE KEYSPACE "my_status"
WITH REPLICATION = {
  'class': 'SimpleStrategy', 'replication_factor': 1
};

Here we created a keyspace called my_status, which we will use for the remainder of this book. When we create a keyspace, we have to specify replication options. Cassandra provides several strategies for managing replication of data; SimpleStrategy is the best strategy as long as your Cassandra deployment does not span multiple data centers. The replication_factor value tells Cassandra how many copies of each piece of data are to be kept in the cluster; since we are only running a single instance of Cassandra, there is no point in keeping more than one copy of the data. In a production deployment, you would certainly want a higher replication factor; 3 is a good place to start.

Note

A few things at this point are worth noting about CQL's syntax:

It's syntactically very similar to SQL; as we further explore CQL, the impression of similarity will not diminish.
Double quotes are used for identifiers such as keyspace, table, and column names. As in SQL, quoting identifier names is usually optional, unless the identifier is a keyword or contains a space or another character that will trip up the parser.
Single quotes are used for string literals; the key-value structure we use for replication is a map literal, which is syntactically similar to an object literal in JSON.
As in SQL, CQL statements in the CQL shell must terminate with a semicolon.

Selecting a keyspace

Once you've created a keyspace, you would want to use it. In order to do this, employ the USE command:

USE "my_status";

This tells Cassandra that all future commands will implicitly refer to tables inside the my_status keyspace. If you close the CQL shell and reopen it, you'll need to reissue this command.

Filter reviews by

All

Feefo verified reviews

Amazon verified reviews

Amazon Customer Oct 07, 2017

The book covers Cassandra features and CQL very well. Book also touches on internal workings of Cassandra.Book doesn't touch any programming language API available for for communicating with Cassandra. It would have been good if one additional chapter has been dedicated to programming language API like Java.

Amazon Verified review

Jascha Casadio Dec 20, 2015

Sooner or later, every professional working with noSQL databases will meet a beauty called Cassandra, a distributed database management system initially developed by Facebook and now a top-level project at the Apache Foundation. Widely acclaimed for its features, Cassandra is a beast difficult to tame. Next to tech giants such as Facebook, Twitter and Ubisoft are a myriad of start-ups selling data driven products relying on Cassandra. The high demand of professionals able to properly set it up and tune it results in a lot of documentation available, plus a very active community ready to improve it and support new comers. Among the many titles available to us at the bookstore is Learning Apache Cassandra by Mat Brown, released early this 2015, a title that covers the latest features and targets beginners, be them professionals or simply curious people interested in a quick taste of what Cassandra is and what is it capable of.Spanning through less than 300 pages, Learning Cassandra is short but very intense book. By the time the reader gets to the back cover, he will feel like he has been exposed to a lot of information and, mainly, that he was gradually been taken, step by step, through the very basics of this technology, without any abrut change of topic, without feeling lost. The book written by Mat is easy to follow and comprehensive. Through a learn by doing approach, he does introduce each and every concepts, starting with an overview of what Cassandra is and how it differs from other solutions, up to topics such as lightweight transactions and collections, passing through the different data types. The different topics are presented to the reader through a unique example, an application that closely resembles some basic feature of Facebook. The examples are explained step by step. Nothing is taken for granted.A couple of things deserve a praise: the first, is the choice of the author to present Cassandra through CQL, rather than through the bindings of a specific programming language, which would make it less useful and interesting to most of us; the second is the many concepts presented first in SQL, then in Cassandra. Mat here beautifully shows the differences between a noSQL database and an old school relational database. The author often highlights the importance of denormalizing the data and to state the questions we want to answer first and then build our queries around them.Being a book for beginners, we must say that this title does not present any real world example. Similarly, Mat does use a test cluster made of a single machine, which is far from what we find in production. This rules out all the worries about replication, consensus and so on.Overall, this is by far the most friendly introduction to Cassandra that I have read so far. While it does not provide anything related to administration and configuration, as well as nothing about real world scenarios, it is a perfect companion to anyone interested in learning what Cassandra is and how to get it to work.As usual, you can find more reviews on my personal blog: http://books.lostinmalloc.com. Feel free to pass by and share your thoughts!

AJ Jun 22, 2015

Thanks to Mat, this book is awesome. I read almost all the books on Cassandra available, and somehow I missed this one. This happens to be my latest reading.One question I have for Mat on Pp 170 of the book. While the diagram seems to be correct, the explanation between the two diagrams on this page, for me, looks incorrect. I could be mistaken.In my opinion the explanation should say Alice is on Node3, Bob is on Node1 and Ivan is on Node2Am I missing something. If any one could clarify I would appreciate it.Five stars for this book and the author.

Amazon Customer Jan 21, 2016

I am completely new to Apache Cassandra. I just decided to learn it out of my own interest. I purchased the Kindle edition of this book 3 days back. Till now I have completed 5th chapter and I must say that after a long time I am reading such a good book.Contents are perfectly organized.Lots of real life examples and solutions.No unnecessary description or irrelevant data.Language and presentation style is so simple that beginners will love to read it.I am completely satisfied with this book.

Nagender Dec 25, 2016

Good book.

Learning Apache Cassandra: Build an efficient, scalable, fault-tolerant, and highly-available data layer into your application using Cassandra

What do you get with a Packt Subscription?

Horizontal scalability

High availability

Write optimization

Structured records

Secondary indexes

Efficient result ordering

Immediate consistency

Discretely writable collections

Relational joins

MapReduce

Comparing Cassandra to the alternatives

Description

What you will learn

Product Details

What do you get with a Packt Subscription?

Product Details

Packt Subscriptions

Frequently bought together

Table of Contents

Recommendations for you

Customer reviews

Filter reviews by

People who bought this also bought

About the author

FAQs