You're reading from Cassandra Design Patterns Build real-world, industry-strength data storage solutions with time-tested design methodologies using Cassandra

Product type Paperback

Published in Nov 2015

Publisher

ISBN-13 9781785285707

Length 168 pages

Edition 2nd Edition

Languages

SQL

Tools

Cassandra

Concepts

Design Patterns

Author (1):

Rajanarayanan Thottuvaikkatumana

View More author details

Table of Contents (9) Chapters

Preface

1. Co-existence Patterns FREE CHAPTER

2. RDBMS Migration Patterns

3. Cache Migration Patterns

4. CAP Patterns

5. Temporal Patterns

6. Analytics Patterns

7. Designing Applications

Index

Chapter 1. Co-existence Patterns

	"It's coexistence or no existence"
	--Bertrand Russell

Relational Database Management Systems (RDBMS) have been pervasive since the '70s. It is very difficult to find an organization without any RDBMS in their solution stack. Huge efforts have gone into the standardization of RDBMS. Because of that, if you are familiar with one RDBMS, switching over to another will not be a big problem. You will remain in the same paradigm without any major shifts. Pretty much all the RDBMS vendors offer a core set of features with standard interfaces and then include their own value-added features on top of it. There is a standardized language to interact with RDBMS called Structured Query Language (SQL). The same queries written against one RDBMS will work without significant changes in another RDBMS. From a skill set perspective, this is a big advantage because you need not learn and relearn new dialects of these query languages as and when the products evolve. These enable the migration from one RDBMS to another RDBMS, which is a painless task. Many application designers designed the applications in an RDBMS agnostic way. In other words, the applications will work with multiple RDBMS. Just change some configuration file properties of the application, and it will start working with a different but supported RDBMS. Many software products are designed to support multiple RDBMS through their configuration file settings to suit the needs of the customers' preferred choice of RDBMS.

Mostly in RDBMS, a database schema organizes objects such as tables, views, indexes, stored procedures, sequences, and so on, into a logical group. Structured and related data is stored in tables as rows and columns. The primary key in a table uniquely identifies a row. There is a very strong theoretical background in the way data is stored in a table.

A table consists of rows and columns. Columns contain the fields, and rows contain the values of data. Rows are also called records or tuples. Tuple calculus, which was introduced by Edgar F. Codd as part of the relational model, serves as basis for the structured query language or SQL for this type of data model. Redundancy is avoided as much as possible. Wikipedia defines database normalization as follows:

"Database normalization is the process of organizing the attributes and tables of a relational database to minimize data redundancy."

Since the emphasis is on avoiding redundancy, related data is spread across multiple tables, and they are joined together with SQL to present data in various application contexts. Multiple indexes that may be defined on various columns in a table can help data retrieval, sorting needs, and maintaining data integrity.

In the recent years, the amount of data that is being generated by various applications is really huge and the traditional RDBMS have started showing their age. Most of the RDBMS were not able to ingest various types of data into their schema. When the data starts flowing in quick succession, traditional RDBMS often become bottlenecks. When data is written into the RDBMS data stores in such speed, in a very short period of time, the need to add more nodes into the RDBMS cluster becomes necessary. The SQL performance degradation happens on distributed RDBMS. In other words, as we enter the era of big data, RDBMS could not handle the three Vs of data: Volume, Variety, and Velocity of data.

Many RDBMS vendors came up with solutions for handling the three Vs of data, but these came with a huge cost. The cost involved in the software licensing, the sophisticated hardware required for that, and the related eco-system of building a fault-tolerant solution stack, started affecting the bottom line in a big way. New generation Internet companies started thinking of different solutions to solve this problem, and very specialized data stores started coming up from these organizations and open source communities based on some of the popular research papers. These data stores are generally termed as NoSQL data stores, and they started addressing very specific data storage and retrieval needs. Cassandra is one of the highly successful NoSQL data stores, which has a very good similarity with traditional RDBMS. The advantage of this similarity comes in handy when Cassandra is adopted by an enterprise. The abstractions of a typical RDBMS and Cassandra have a few similarities. Because of this, new users can relate things to RDBMS and Cassandra. From a logical perspective Cassandra tables have a similarity with RDBMS-based tables in the view of the users, even though the underlying structures of these tables are totally different. Because of this, Cassandra is the best fit to be deployed along with the traditional RDBMS to solve some of the problems that RDBMS is not able to handle.

The caveat here is that because of the similarity of RDBMS tables and Cassandra column families (also known as Cassandra tables) in the view of the end users, many users and data modelers try to use Cassandra in exactly the same way as an RDBMS schema is being modeled, used, and is getting into the serious deployment issues. How do you prevent such pitfalls? At the outset, Cassandra may look like a traditional RDBMS data store. But the fact is that it is not the same. The key here is to understand the differences from a theoretical perspective as well as in a practical perspective, and follow the best practices prescribed by the creators of Cassandra.

Tip

In Cassandra, the terms "column family" and "table" are synonymous. The Cassandra Query Language (CQL) command syntax uses the term "table."

Why can Cassandra be used along with other RDBMS? The answer to that lies in the limitations of RDBMS. Some of the obvious ones are cost savings, the need to scale out, handling high-volume traffic, complex queries slowing down response times, the data types are getting complex, and the list goes on and on. The most important aspect of the need for Cassandra to coexist with legacy RDBMS is that you need to preserve the investments made already and make sure that the current applications are working without any problems. So, you should protect your investments, make your future investments in a smart NoSQL store such as Cassandra, and follow the one-step-at-a-time approach.

You're reading from Cassandra Design Patterns Build real-world, industry-strength data storage solutions with time-tested design methodologies using Cassandra

Table of Contents (9) Chapters

Chapter 1. Co-existence Patterns

Tip

Authors (1)

Personalised recommendations for you