Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Mastering Apache Cassandra - Second Edition

You're reading from   Mastering Apache Cassandra - Second Edition Build, manage, and configure high-performing, reliable NoSQL database for your application with Cassandra

Arrow left icon
Product type Paperback
Published in Mar 2015
Publisher Packt
ISBN-13 9781784392611
Length 350 pages
Edition 1st Edition
Languages
Arrow right icon
Toc

Introduction to Cassandra

Quoting from Wikipedia:

"Apache Cassandra is an open source distributed database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Cassandra offers robust support for clusters spanning multiple datacenters, with asynchronous masterless replication allowing low latency operations for all clients."

Let's try to understand in detail what it means.

A distributed database

In computing, distributed means splitting data or tasks across multiple machines. In the context of Cassandra, it means that the data is distributed across multiple machines. It means that no single node (a machine in a cluster is usually called a node) holds all the data, but just a chunk of it. It means that you are not limited by the storage and processing capabilities of a single machine. If the data gets larger, add more machines. If you need more parallelism (ability to access data in parallel/concurrently), add more machines. This means that a node going down does not mean that all the data is lost (we will cover this issue soon).

If a distributed mechanism is well designed, it will scale with a number of nodes. Cassandra is one of the best examples of such a system. It scales almost linearly, with regard to performance, when we add new nodes. This means that Cassandra can handle the behemoth of data without wincing.

Note

Check out an excellent paper on the NoSQL database comparison titled, Solving Big Data Challenges for Enterprise Application Performance Management at http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf.

High availability

We will discuss availability in the next chapter. For now, assume availability is the probability that we query and the system just works. A high-availability system is one that is ready to serve any request at any time. High availability is usually achieved by adding redundancies. So, if one part fails, the other part of the system can serve the request. To a client, it seems as if everything works fine.

Cassandra is a robust software. Nodes joining and leaving are automatically taken care of. With proper settings, Cassandra can be made failure-resistant. This means that if some of the servers fail, the data loss will be zero. So, you can just deploy Cassandra over cheap commodity hardware or a cloud environment, where hardware or infrastructure failures may occur.

Replication

Continuing from the previous two points, Cassandra has a pretty powerful replication mechanism (we will see more details in the next chapter). Cassandra treats every node in the same manner. Data need not be written on a specific server (master), and you need not wait until the data is written to all the nodes that replicate this data (slaves). So, there is no master or slave in Cassandra, and replication happens asynchronously. This means that the client can be returned with success as a response as soon as the data is written on at least one server. We will see how we can tweak these settings to ensure the number of servers we want to have data written on before the client returns.

From this, we can derive that when there is no master or slave, we can write to any node for any operation. Since we have the ability to choose how many nodes to read from or write to, we can tweak it to achieve very low latency (read or write from one server).

Multiple data centers

Expanding from a single machine to a single data center cluster or multiple data centers is very simple compared to traditional databases where you need to make a plethora of configuration changes and watch replication. If you are planning to shard, it becomes a developer's nightmare. We will see later in this book that we can use this data center setting to make a real-time replicating system across data centers. We can use each data center to perform different tasks without overloading the other data centers. This is a powerful support when you do not have to worry whether users in Japan with a data center in Tokyo and users in the US with a data center in Virginia, are in sync or not.

These are just broad strokes of Cassandra's capabilities. We will explore more in the upcoming chapters. This chapter is about getting excited learning about Cassandra.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image