(For more resources related to this topic, see here.)
Cassandra uses a peer-to-peer architecture, unlike a master-slave architecture, which is prone to single point of failure (SPOF) problems. Cassandra is deployed on multiple machines with each machine acting as a node in a cluster. Data is autosharded, that is, automatically distributed across nodes using key-based sharding, which means that the keys are used to distribute the data across the cluster. Each key-value data element in Cassandra is replicated across the cluster on other nodes (the default replication is 3) for high availability and fault tolerance. If a node goes down, the data can be served from another node having a copy of the original data.
Sharding is an old concept used for distributing data across different systems. Sharding can be horizontal or vertical. In horizontal sharding, in case of RDBMS, data is distributed on the basis of rows, with some rows residing on a single machine and the other rows residing on other machines. Vertical sharding is similar to columnar storage, where columns can be stored separately in different locations.
Hadoop Distributed File Systems (HDFS) use data-volumes-based sharding, where a single big file is sharded and distributed across multiple machines using the block size. So, as an example, if the block size is 64 MB, a 640 MB file will be split into 10 chunks and placed in multiple machines.
The same autosharding capability is used when new nodes are added to Cassandra, where the new node becomes responsible for a specific key range of data. The details of what node holds what key ranges is coordinated and shared across the cluster using the gossip protocol. So, whenever a client wants to access a specific key, each node locates the key and its associated data quickly within a few milliseconds. When the client writes data to the cluster, the data will be written to the nodes responsible for that key range. However, if the node responsible for that key range is down or not reachable, Cassandra uses a clever solution called Hinted Handoff that allows the data to be managed by another node in the cluster and to be written back on the responsible node once that node is back in the cluster.
The replication of data raises the concern of data inconsistency when the replicas might have different states for the same data. Cassandra uses mechanisms such as anti-entropy and read repair for solving this problem and synchronizing data across the replicas. Anti-entropy is used at the time of compaction, where compaction is a concept borrowed from Google BigTable. Compaction in Cassandra refers to the merging of SSTable and helps in optimizing data storage and increasing read performance by reducing the number of seeks across SSTables. Another problem that compaction solves is handling deletion in Cassandra. Unlike traditional RDBMS, all deletes in Cassandra are soft deletes, which means that the records still exist in the underlying data store but are marked with a special flag so that these deleted records do not appear in query results. The records marked as deleted records are called tombstone records. Major compactions handle these soft deletes or tombstones by removing them from the SSTable in the underlying file stores. Cassandra, like Dynamo, uses a Merkle tree data structure to represent the data state at a column family level in a node. This Merkle tree representation is used during major compactions to find the difference in the data states across nodes and reconciled.
The Merkle tree or Hash tree is a data structure in the form of a tree where every non-leaf node is labeled with the hash of children nodes, allowing the efficient and secure verification of the contents of the large data structure.
Cassandra, like Dynamo, falls under the AP part of the CAP theorem and offers a tunable consistency level. Cassandra provides multiple consistency levels, as illustrated in the following table:
Operation |
ZERO |
ANY |
ONE |
QUORUM |
ALL |
Read |
Not supported |
Not supported |
Reads from one node
|
Read from a majority of nodes with replicas |
Read from all the nodes with replicas |
Write |
Asynchronous write |
Writes on one node including hints |
Writes on one node with commit log and Memtable |
Writes on a majority of nodes with replicas |
Writes on all the nodes with replicas |
The following table summarizes the key features of Cassandra with respect to its origins in Google BigTable and Amazon Dynamo:
Feature |
Cassandra implementation |
Google BigTable |
Amazon Dynamo |
Architecture |
Peer-to-peer architecture, ring-based deployment architecture |
No |
Yes
|
Data model |
Multidimensional map (row,column, timestamp) -> bytes |
Yes
|
No |
CAP theorem |
AP with tunable consistency |
No |
Yes
|
Storage architecture |
SSTable, Memtables |
Yes
|
No |
Storage layer |
Local filesystem storage |
No |
No |
Fast reads and efficient storage |
Bloom filters, compactions |
Yes
|
No |
Programming language |
Java |
No |
Yes
|
Client programming language |
Multiple languages supported: Java, PHP, Python, REST, C++, .NET, and so on. |
Not known |
Not known |
Scalability model |
Horizontal scalability; multiple nodes deployment than a single machine deployment |
Yes
|
Yes
|
Version conflicts |
Timestamp field (not a vector clock as usually assumed) |
No |
No |
Hard deletes/updates |
Data is always appended using the timestamp field—deletes/updates are soft appends and are cleaned asynchronously as part of major compactions |
Yes
|
No |
Cassandra packs the best features of two technologies proven at scale—Google BigTable and Amazon Dynamo. However, today Cassandra has evolved beyond these origins with new unique and enterprise-ready features such as Cassandra Query Language (CQL), support for collection columns, lightweight transactions, and triggers.
Further resources on this subject: