Preface
Apache Cassandra is one of the most popular NoSQL data stores based on the research papers Dynamo: Amazon's Highly Available Key-value Store and Bigtable: A Distributed Storage System for Structured Data. Cassandra is implemented with the best features from both of these research papers. In general, NoSQL data stores can be classified into the following groups.
- Key-value data store
- Column-family data store
- Document data store
- Graph data store
Cassandra belongs to the column-family data store group. Cassandra's peer-to-peer architecture avoids single-point failures in the cluster of Cassandra nodes and gives the ability to distribute the nodes across racks or data centers. This makes Cassandra a linearly scalable data store. In other words, the greater your processing need, the more Cassandra nodes you can add to your cluster. Cassandra's multidata center support makes it a perfect choice to replicate data stores across data centers for disaster recovery, high availability, separating transaction processing, and analytical environments for building resiliency into the data store infrastructure.
The basic data abstraction in Cassandra starts with a column consisting of a name, value, timestamp, and optional time-to-live attributes. A row comes with a row key and a collection of sorted columns. A column family or a table is a collection of of rows. A keyspace is a collection of column families.
Cassandra 2.1 comes with a lot of new features making it an even more powerful data store than ever before. Now the new CQL keyword IF NOT EXISTS lets you check the existence of an object before Cassandra creates a new one. Lightweight transactions and the batching of CQL commands gives the user an ability to perform multistep atomic operations. Marking some columns in a column family as STATIC gives the user the ability to share data across all the rows of a given partition. The user-defined data type gives the power of modeling your data store very close to the real-world objects and objects used in the applications written using object-oriented programming languages. Collection indexes may be used to index and query collection data types in Cassandra. Row cache improvements, changes to reads and writes, off-heap memory tables, incremental node repair, and the new counter implementation all make Cassandra perform much better than its previous releases.
All the code samples that are used in this book are written for Cassandra 2.1.5. All the examples are as per the CQL specification 3.x. The pre-CQL Thrift API-based Cassandra CLI is being used to list the physical layout of the column families. An insight into the physical layout is very important because a wrong choice of a partition key or primary key will result in insidious performance problems. As a best practice, it is a good idea to create the column family, insert a couple of records, and use the list command in the Cassandra CLI with the column-family name. It will give the physical layout.
The term "design patterns" is a highly misinterpreted term in the software development community. In a very general sense, it is a set of solutions for some known problems in a very specific context. The way it is being used in this book is to describe a pattern of using certain features of Cassandra to solve some real-world problems. To refer to them and to identify them later, a name is also given to each of such design patterns. These pattern names may not be related at all to any similar sounding design pattern name used in other contexts and in other software development paradigms.
Users love Cassandra because of its SQL-like interface, CQL, and its features are very closely related to the RDBMS even though the paradigm is totally new. Application developers love Cassandra because of the plethora of drivers available in the market so that they can write applications in their preferred programming language. Architects love Cassandra because they can store structured, semi-structured, and unstructured data in it. Database administers love Cassandra because it comes with almost no maintenance overhead. Service managers love Cassandra because of the wonderful monitoring tools available in the market. CIOs love Cassandra because it gives value for their money. And, Cassandra works!