Cassandra Design Patterns

Chapter 1. Co-existence Patterns

	"It's coexistence or no existence"
	--Bertrand Russell

Relational Database Management Systems (RDBMS) have been pervasive since the '70s. It is very difficult to find an organization without any RDBMS in their solution stack. Huge efforts have gone into the standardization of RDBMS. Because of that, if you are familiar with one RDBMS, switching over to another will not be a big problem. You will remain in the same paradigm without any major shifts. Pretty much all the RDBMS vendors offer a core set of features with standard interfaces and then include their own value-added features on top of it. There is a standardized language to interact with RDBMS called Structured Query Language (SQL). The same queries written against one RDBMS will work without significant changes in another RDBMS. From a skill set perspective, this is a big advantage because you need not learn and relearn new dialects of these query languages as and when the products evolve. These enable the migration from one RDBMS to another RDBMS, which is a painless task. Many application designers designed the applications in an RDBMS agnostic way. In other words, the applications will work with multiple RDBMS. Just change some configuration file properties of the application, and it will start working with a different but supported RDBMS. Many software products are designed to support multiple RDBMS through their configuration file settings to suit the needs of the customers' preferred choice of RDBMS.

Mostly in RDBMS, a database schema organizes objects such as tables, views, indexes, stored procedures, sequences, and so on, into a logical group. Structured and related data is stored in tables as rows and columns. The primary key in a table uniquely identifies a row. There is a very strong theoretical background in the way data is stored in a table.

A table consists of rows and columns. Columns contain the fields, and rows contain the values of data. Rows are also called records or tuples. Tuple calculus, which was introduced by Edgar F. Codd as part of the relational model, serves as basis for the structured query language or SQL for this type of data model. Redundancy is avoided as much as possible. Wikipedia defines database normalization as follows:

"Database normalization is the process of organizing the attributes and tables of a relational database to minimize data redundancy."

Since the emphasis is on avoiding redundancy, related data is spread across multiple tables, and they are joined together with SQL to present data in various application contexts. Multiple indexes that may be defined on various columns in a table can help data retrieval, sorting needs, and maintaining data integrity.

In the recent years, the amount of data that is being generated by various applications is really huge and the traditional RDBMS have started showing their age. Most of the RDBMS were not able to ingest various types of data into their schema. When the data starts flowing in quick succession, traditional RDBMS often become bottlenecks. When data is written into the RDBMS data stores in such speed, in a very short period of time, the need to add more nodes into the RDBMS cluster becomes necessary. The SQL performance degradation happens on distributed RDBMS. In other words, as we enter the era of big data, RDBMS could not handle the three Vs of data: Volume, Variety, and Velocity of data.

Many RDBMS vendors came up with solutions for handling the three Vs of data, but these came with a huge cost. The cost involved in the software licensing, the sophisticated hardware required for that, and the related eco-system of building a fault-tolerant solution stack, started affecting the bottom line in a big way. New generation Internet companies started thinking of different solutions to solve this problem, and very specialized data stores started coming up from these organizations and open source communities based on some of the popular research papers. These data stores are generally termed as NoSQL data stores, and they started addressing very specific data storage and retrieval needs. Cassandra is one of the highly successful NoSQL data stores, which has a very good similarity with traditional RDBMS. The advantage of this similarity comes in handy when Cassandra is adopted by an enterprise. The abstractions of a typical RDBMS and Cassandra have a few similarities. Because of this, new users can relate things to RDBMS and Cassandra. From a logical perspective Cassandra tables have a similarity with RDBMS-based tables in the view of the users, even though the underlying structures of these tables are totally different. Because of this, Cassandra is the best fit to be deployed along with the traditional RDBMS to solve some of the problems that RDBMS is not able to handle.

The caveat here is that because of the similarity of RDBMS tables and Cassandra column families (also known as Cassandra tables) in the view of the end users, many users and data modelers try to use Cassandra in exactly the same way as an RDBMS schema is being modeled, used, and is getting into the serious deployment issues. How do you prevent such pitfalls? At the outset, Cassandra may look like a traditional RDBMS data store. But the fact is that it is not the same. The key here is to understand the differences from a theoretical perspective as well as in a practical perspective, and follow the best practices prescribed by the creators of Cassandra.

Tip

In Cassandra, the terms "column family" and "table" are synonymous. The Cassandra Query Language (CQL) command syntax uses the term "table."

Why can Cassandra be used along with other RDBMS? The answer to that lies in the limitations of RDBMS. Some of the obvious ones are cost savings, the need to scale out, handling high-volume traffic, complex queries slowing down response times, the data types are getting complex, and the list goes on and on. The most important aspect of the need for Cassandra to coexist with legacy RDBMS is that you need to preserve the investments made already and make sure that the current applications are working without any problems. So, you should protect your investments, make your future investments in a smart NoSQL store such as Cassandra, and follow the one-step-at-a-time approach.

A brief overview of Cassandra

Where do you start with Cassandra? The best place is to look at the new application development requirements and take it from there. Look at cases where there is a need to denormalize the RDBMS tables and keep all the data items together, which would have been distributed if you were to design the same solution in an RDBMS. If an application is writing a set of data items together into a data store, why do you want to separate them out? No need to worry about redundancy. This is the new NoSQL philosophy. This is the new way to look at data modeling in NoSQL data stores. Cassandra supports fast writes and reads. Initial versions of Cassandra had some performance problems, but a huge number of optimizations have gone into making the latest version of Cassandra perform much better for reads as well as writes. There is no problem with consuming space because the secondary storage is getting cheaper and cheaper. A word of caution here is that, it is fine to write the data into Cassandra, whatever the level of redundancy, but the data access use cases have to be thought through carefully before getting involved in the Cassandra data model. The data is stored in the disk, to be read at a later date. These reads have to be efficient, and it gives the required data in the desired sorted order.

In a nutshell, you should decide how do you want to store the data and make sure that it is giving you the data in the desired sort order. There is no hard and fast rule for this. It is purely up to the application requirements. That is, the other shift in the thought process.

Instead of thinking from the pure data model perspective, start thinking in terms of the application's perspective. How the data is generated by the application, what are the read requirements, what are the write requirements, what is the response time expected out of some of the use cases, and so on. Depending on these aspects, design the data model. In the big data world, the application becomes the first class citizen and the data model leaves the driving seat in the application design. Design the data model to serve the needs of the applications.

In any organization, new reporting requirements come all the time. The major challenge to generate reports is the underlying data store. In the RDBMS world, reporting is always a challenge. You may have to join multiple tables to generate even simple reports. Even though the RDBMS objects such as views, stored procedures, and indexes may be used to get the desired data for the reports, when the report is being generated, the query plan is going to be very complex most of the time. Consumption of the processing power is another need to consider when generating such reports on the fly. Because of these complexities, many times, for reporting requirements, it is common to keep separate tables containing data exported from the transactional tables. Martin Fowler emphasizes the need for separating reporting data from the operations data in his article, Reporting Database. He states:

"Most Enterprise Applications store persistent data with a database. This database supports operational updates of the application's state, and also various reports used for decision support and analysis. The operational needs and the reporting needs are, however, often quite different - with different requirements from a schema and different data access patterns. When this happens it's often a wise idea to separate the reporting needs into a reporting database, which takes a copy of the essential operational data but represents it in a different schema".

This is a great opportunity to start with NoSQL stores such as Cassandra as a reporting data store.

Data aggregation and summarization are the common requirements in any organization. This helps to control data growth in by storing only the summary statistics and moving the transactional data into archives. Often, these aggregated and summarized data are used for statistical analysis. In many websites, you can see the summary of your data instantaneously when you log in to the site or when you perform transactions. Some of the examples include the available credit limit of credit cards, the available number of text messages, remaining international call minutes in a mobile phone account, and so on. Making the summary accurate and easily accessible is a big challenge. Most of the time, data aggregation and reporting go hand in hand. The aggregated data is used heavily in reports. The aggregation process speeds up the queries to a great extent. In RDBMS, it is always a challenge to aggregate data, and you can find new requirements coming all the time. This is another place you can start with NoSQL stores such as Cassandra.

Now, we are going to discuss some aspects of the denormalization, reporting, and aggregation of data using Cassandra as the preferred NoSQL data store.

Denormalization pattern

Denormalize the data and store them as column families in Cassandra. This is a very common practice in NoSQL data stores. There are many reasons why you might do this in Cassandra. The most important aspect is that Cassandra doesn't support joins between the column families. Redundancy is acceptable in Cassandra as storage is cheap, and this is more relevant for Cassandra because Cassandra runs on commodity hardware while many RDBMS systems need much better hardware specifications for the optimal performance when deployed in production environments. Moreover, the read and write operations are highly efficient even if the column families are huge in terms of the number of columns or rows. In the traditional RDBMS, you can create multiple indexes on a single table on various columns. But in Cassandra, secondary indexes are very costly and they affect the performance of reads and writes.

Motivations/solutions

In many situations, whenever a new requirement comes, if you think in the traditional RDBMS way, it will lead to many problems such as poor performance on read/write, long running processes, queries becoming overly complex, and so on. In this situation, one of the best approaches is to apply denormalization principles and design column families in Cassandra.

In the traditional RDBMS, the operational tables contain the data related to the current state of the entities and objects involved. So, maintaining lookups for preserving the integrity of the data is perfectly sensible. But when you have to maintain history, the concept of lookups will not work. For example, when you are generating a monthly bank account statement, the current statement should reflect the current address of the account holder. After the statement is generated, if the address of the account holder changes during the next reporting period, then the previous statements must not reflect the new address. They must have the old address, which was correct on the date that the statement was generated. In such situations, it does not make sense to keep a set of normalized tables for the historical data. The best thing to do at that time is to denormalize the data and maintain them in separate column families in Cassandra.

Complex queries are part of any system. In the RDBMS world, to shield the complexities of the query from the end users who design data entry forms, generate reports, typically views or stored procedures are designed. They are useful to run ad hoc queries, retrieve special set of records, and so on. Even though you solved the complexity problem from an end user perspective, the real problem remains unsolved. This means that when you run those queries or stored procedures, the complex joins of the tables are happening under the hood and on the fly. Because of this, the running time is greater and the processing requirements are huge. In the NoSQL world, it is better to denormalize the data and maintain them as big column families in Cassandra.

Immutable transactions are good candidates for denormalization because they capture the present state, the references it makes to other table records can be carried with the record of the transaction even if those references change in the future. The only use those transactions will have in the future is for read use cases. An immutable transaction means that once a transaction record it is written to the system, nothing is going to change in the future. There are many examples in real life that conform to this type, such as banking transaction records, weather station reading records, utility monitoring reading records, system monitoring records, service monitoring records, and you can find countless examples in your real-life applications. Event records originating from event management systems are possible candidates for denormalization, but caution has to be exercised when the event status changes and if the same record is being updated. If the event management systems generate multiple event records for state changes of the same event, denormalization will be a good fit. Capture these denormalized records in Cassandra.

Note

Performance boosting requirements are good situations where denormalization may be applied. There may be many applications performing poorly when data is being written to the RDBMS. There is a strong possibility that this is happening because of the single transaction writing data into multiple tables and many indexes are being used in those tables. Careful analysis and proper characterization of the performance problems lead to data spread across multiple tables as the root cause many times. In such cases, denormalization of the data is the obvious option, and Cassandra comes as a good fit there.

Data modeling experts in the RDBMS world are typically not comfortable in denormalization because there is a general belief that the data integrity is maintained by the RDBMS table design itself, along with other features of RDBMS such as triggers, stored procedures, and so on. The data modeling in Cassandra is different. Here, along with the data model, all the application use cases where there is a data manipulation involved is also taken into consideration. So, the data integrity maintenance is the responsibility of the applications as well. Here, denormalization is the norm and the applications using the column families are supposed to handle the data integrity to make sure that the data is good.

Best practices

Denormalization must be done with utmost care. Normalization avoids redundancy and it promotes principles to maintain data integrity. When you denormalize, the only rule that is relaxed is redundancy. Data integrity must be maintained for a successful application, even if data is denormalized. With normalized tables in the RDBMS, the primary key constraints, foreign key constraints, unique key constraints and so on serve as watchdogs that maintain the data integrity even if the applications don't care about them. Verification and validation happens even at the RDBMS level. When moving to a NoSQL store such as Cassandra, many such goodies of RDBMS are lost. So, it is the responsibility of the application designers to prevent insert anomalies, update anomalies, and delete anomalies. Even though Cassandra comes with lightweight transactions, most of the data integrity control measures have to be taken from the application's side. Cassandra security has to be used heavily to make sure that only the proper applications with the right credentials are writing data to the column families.

Example

Let's take the case of a very simple normalized relation from the RDBMS world, as shown in the following screenshot. There are two tables in the relationship. One stores the customer details and the other stores the order details of the customers. This is a one-to-many relation where every customer record in the Customer table may have zero or more order records in the Order table. These two tables are joined by the primary CustomerId key.

Figure 1

In the Order table, CustomerId is a foreign key referring the CustomerId of the Customer table. When you denormalize this to a Cassandra column family, it will look like the one given in the following figure. In Cassandra, the primary key is a combination of CustomerId and OrderId in the CustomerOrder column family. The CustomerId column becomes the partition key for this column family.

Figure 2

The denormalized Cassandra column family has all the fields of its normalized counter part. The following script can be used to create the column family in Cassandra. The following is the sequence of activities given in the script:

Create the key space.
Create the column family.
Insert one record into the column family.

The reason one record is inserted into the column family is to demonstrate the difference between the physical layout of the rows stored in Cassandra and how the queries are returning the same records.

Tip

It is very important to make sure that the physical layout of the column family is as expected and see how the columns are getting stored. To view the physical layout of the records, the old Cassandra CLI [cassandra-cli] must be used. This Cassandra CLI is used throughout this book in the context where there is a need to view the physical layout of the data in the column families.

Whenever a new column family is defined in Cassandra, it is very important to have an understanding of the physical and the logical views, and this helps the characterization of the column family growth and other behaviors. The following script is to be executed in cqlsh:

CREATE KEYSPACE PacktCDP1 WITH replication = {'class': 'SimpleStrategy', 'replication_factor' : 3};
USE PacktCDP1;
CREATE TABLE CustomerOrder (
  CustomerId bigint,
  OrderId bigint,
  CustomerName text static,
  Email text static,
  OrderDate timestamp, 
  OrderTotal float,
  PRIMARY KEY (CustomerId, OrderId)
  )
  WITH CLUSTERING ORDER BY (OrderId DESC);

INSERT INTO CustomerOrder (CustomerId, OrderId, CustomerName, Email, OrderDate, OrderTotal) VALUES (1,1,'Mark Thomas', 'mt@example.com', 1433970556, 112.50);

A great detail of attention needs to be given while choosing your primary key for a given column family. Conceptually, this is totally different from the RDBMS world. It is true that a primary key uniquely identifies a row. It may be an individual column or a combination of multiple columns. The differentiation comes in Cassandra is the way in which a row is stored in the physical nodes of Cassandra. The first column of the primary key combination is known as the partition key. All the rows in a given column family with the same partition key get stored in the same physical Cassandra node. The commands in the script here are to be executed in the Cassandra CLI interface:

USE PacktCDP1;
list CustomerOrder;
Using default limit of 100
Using default cell limit of 100
RowKey: 1
=> (name=1:, value=, timestamp=1433970886092681)
=> (name=1:customername, value=4d61726b2054686f6d6173, timestamp=1433970886092681)
=> (name=1:email, value=6d74406578616d706c652e636f6d, timestamp=1433970886092681)
=> (name=1:orderdate, value=000000005578a77c, timestamp=1433970886092681)
=> (name=1:ordertotal, value=42e10000, timestamp=1433970886092681)

1 Row Returned.

In the output, take a look at the row key. In the preceding example, as per the primary key used, the CustomerId field will be the row key. In other words, for every CustomerId, there will be one wide row. It is termed wide because all the records of a given CustomerId field, which is a partition key, will be stored in one row. One physical row of the Cassandra column family stores contains many records. In the use cases for which the column family is designed, it is important to make sure that if the row is growing, whether it is going to run out of the prescribed number of columns? If yes, then the design has to be looked at again and an appropriate partition key has to be identified. At the same time, it is not economical to have column families having only a very few rows and very few columns.

In a typical RDBMS table, the customer details will be in one or more tables. The transaction records such as order details will be in another table. But, here denormalization is applied, and data coming from those two logical entities are captured in one Cassandra column family.

The following CQL SELECT command gives the output in a human readable format:

SELECT * FROM CustomerOrder;

The output can be seen in this screenshot:

Figure 3

Reporting pattern

Design separate column families in Cassandra for the reporting needs. Keep the operational data in RDBMS and the reporting data in Cassandra column families. The most important reason why this separation of concerns is good practice is because of the tunable consistency feature in Cassandra. Depending on the use cases and for performance reasons, various consistency parameters may be used in the Cassandra column families and in the applications using Cassandra for read and write operations. When data ingestion happens in Cassandra, the consistency parameters have to be tuned to have fast writes.

The consistency parameters used for fast writes may not be suitable for fast reads. So, it is better to design separate column families for reporting purposes. From the same operational data, if various types of reports have to be created, it may be wise to create separate column families for these different reporting requirements. It is also a common practice to preprocess the operational data to generate fast reports. Historical reporting, data archival, statistical analysis, providing data feeds, inputs for machine learning algorithms such as recommendation systems and so on benefit a lot from accessing data from the Cassandra column families specifically designed for reporting.

Motivations/solutions

Coming from the RDBMS world, people rarely think about the reporting needs in the very beginning. The main reason behind that is, there is good flexibility with the SQL queries, and you can pretty much get any kind of report from the RDBMS tables because you may join RDBMS tables. So, the application designers and data modelers focused on the data and the business logic first. Then, they came to thinking about the reports. Even though this strategy worked, it introduced lots of application performance problems either toward the end of the application development or when the data volume has grown beyond certain limit. The best thing to do in these kind of situations is to design separate Cassandra column families for the reporting needs.

In social media and real media applications commonly used by millions of users at a given time, reporting needs are huge. Most importantly, the performance of those reports are even more paramount. For example, in a movie streaming website, users post videos. Users follow other users. The followers like the videos posted by the users whom they are following. Now, take the two important views in the website: the first one that gives the list of videos liked by a given user, the second one gives the list of users liking a given video. In the RDBMS world, it is fine to use one table to store the data items to generate these two reports. In Cassandra, it is better to define two column families to generate these two reports. You may be wondering why this can't be achieved by reading out of a single column family. The reason is that the sorting order matters in Cassandra, and the records are stored in sorted order.

In Cassandra, it may be necessary to create different column families to produce different reports. As mentioned earlier, it is fine to write the same piece of data into multiple Cassandra column families. There is no need to panic as the latest versions of Cassandra comes with batching capability for data manipulation operations. In this way, the data integrity may be ensured. It may not be as flexible and powerful as many RDBMS, but there are ways to do this in Cassandra. For example, take the case of a hypothetical Twitter-like application.

Users tweet and the tweets have to be shown differently in the default view, differently in the listing using hashtags, differently in the user time line, and so on. Assuming that Cassandra is being used for storing the tweets, you may design different Cassandra column families for materializing these different views. When a new tweet comes in, that record will be inserted into all these different column families. To maintain the data integrity, all these INSERT statements may be designed as atomic unit of statements by enclosing them between the BEGIN BATCH and APPLY BATCH statements of CQL, as batches are atomic by default.

When it comes to reporting, RDBMS fails miserably in many use cases. This is seen when the report data is produced by many table joins, and the number of records in these tables are huge. This is a common situation when there is complex business logic to be applied to the operational data stored in the RDBMS before producing the reports. In such situations, it is always better to go with creating separate column families in Cassandra for the reporting needs. This may be done in two ways. The first method is the online way, in which the operational data is transformed into analytical or reporting data and stored in Cassandra column families. The second method is the batching way. In regular intervals, transform the operational data into analytical or reporting data in a batch process with business logic processors storing the data in Cassandra column families.

Predictive analytics or predictive modeling is very common these days in the commercial and scientific applications. A huge amount of operational data is processed, sliced, and diced by the data scientists using various machine learning algorithms and produces outputs for solving classification, regression, and clustering problems. These are highly calculation-intensive operations and deals with huge amount of operational data. It is practically impossible to do these calculations on the fly for the instantaneous requests from the users of the system. In this situation, the best course of action is to continually process the data and store the outputs in Cassandra column families.

Tip

There is a huge difference between the reporting data and analytical data. The former deals with producing the data from the data store as per the user's selection criteria. The latter deals with the data to be processed to give a result to the user as an answer to some of their questions, such as "why there is a budget overrun this year?", "when the forecast and the actual started deviating?" and so on. Whenever such questions are asked, the analytical data is processed and a result is given.

Graphical representation of vital statistics is an important use case in many of the applications. These days many applications provide huge amount of customization for the users to generate user-defined graphs and charts. For making this happen, there are sophisticated graphing and charting software packages that are available in the market. Many of these software packages expect the data in certain format to produce the graphical images. Most of the time, the operational data may not be available in the specific format suitable for these specialized software packages. In these situations, the best choice of any application designer is to transform the operational data to suit the graphing and charting requirements. This is a good opportunity to use separate Cassandra column families to store the data specific for the graphing and charting.

Operational and historical reporting are two different types of needs in many applications. Operational data is used to report the present, and historical data is used to report the past. Even in the operational reporting use cases, there are very good reasons to separate the reporting data to different Cassandra column families. In the historical reporting use cases, it is even more important because the data grows over a period of time. If the velocity of the operational data is very high, then the historical reporting becomes even more cumbersome. Bank account statements, credit card statements, payslips, telephone bills, and so on, are very good examples of historical reporting use cases.

In the past, organizations used to keep the historical data as long as it is needed in the system for the compliance and governance requirements. Things have changed. Many organizations have started keeping the data eternally to provide value-added services as the storage is becoming cheaper and cheaper these days. Cassandra is a linearly scalable NoSQL data store. The more storage requirements you have, the more nodes you can add to its cluster as and when you need without any need to reconfigure or any downtime, and the cluster will start making use of the newly added nodes. Read operations are really fast, so reporting is a highly used use case supported by Cassandra column families with the clever use of tunable consistency.

In the old generation applications, operational data is archived for posterity and auditing purposes. Typically, after its decided lifetime, operational data is taken offline and archived so that the data growth is not affecting the day-to-day operations of the system. The main reason why this archival is needed is because of the constraints in the data storage solutions and the RDBMS used. Clustering and scaling out of RDBMS-based data store is very difficult and extremely expensive. The new generation NoSQL data stores such as Cassandra are designed to scale out and run on commodity hardware. So, the need to take the data offline doesn't exist at all. Design Cassandra column families to hold the data marked for archival and they can be kept online for ever. Watch out the data growth and keep on adding more and more Cassandra nodes into the Cassandra cluster as and when required.

The emergence of cloud as the platform of choice and proliferation of Software as a Service (SaaS) applications introduced one more complexity into the application design, which is multitenancy. Multitenancy promotes the use of one instance of an application catering to the needs of multiple customers. Most of these SaaS applications give its customers a good amount of customization in terms of the features and reports. The service providers who host these SaaS have a new challenge of maintaining customer specific data and reports. This is a good use case where separate Cassandra column families to be used for maintaining customer-specific data needed for the tailor made reports.

Financial exchanges, trading systems, mobile phone services, weather forecasting systems, airline reservation systems, and the like produce high-volume data and process them with subsecond response to their end users. Obviously, the reporting needs are also huge in those applications in terms of the number of records to be processed and the complexity of data processing required. In all these systems-separating operations data and reporting data is a very important requirement. Cassandra is a good fit in all these reporting use cases.

Data transformation is an important step in producing many reports. In the enterprise application integration use cases, often one application will have to provide data to another application in a certain format. XML and JSON are two important data exchange formats. In applications with service-oriented architecture, whether they consume or produce services, data is required in specific formats. Whatever the technology used to perform the transformation may be, because of the volume of the data, it is practically impossible to process these data as and when required on a real-time basis. Preprocessing is required in many situations to produce the data in specific formats. Even though RDBMS supports data types such as BLOB and CLOB to store huge chunk of data, often the limitations of RDBMS will take effect. NoSQL data stores such as Cassandra are designed to handle very sophisticated data types built using user-defined data types, and it is easy to use them for storing the preprocessed data for the future reporting purposes.

Providing data feeds to external systems is a very common requirement these days. This is a very effective mechanism for disseminating data asynchronously to the subscribers of the data through popular mechanisms such as RSS feeds. The data designated for the data feeds must be derived from the operational data. Cassandra column families may be designed to serve such requirements.

Best practices

Separation of operational and reporting data stores is a good idea in many cases, but care must be taken to check if it is violating any of the data integrity or business logic invariants in the system. Immutable data items are really good candidates for this separation because they are not going to be changed anytime in the future. It is better to keep the frequently changing data items in the operational data stores itself. In Cassandra, if column families are used only for reporting purposes, care must be taken on how to load the data into the column families. Typically, these column families will be optimized for fast reads. If too many writes into those column families are going to take place, the consistency parameters will have to be tuned very well so that it does not defeat the original purpose of creating those column families, which is to read data from it. It is very difficult to tune Cassandra column families to suit the needs of very fast writes and reads. One of these will need to be compromised. Since the original purpose of these column families is to provide fast reads, the speed of writes must be controlled. More detailed treatment on the tuning for fast reads and fast writes in Cassandra column families is given in the coming chapters of this book.

Example

Let's take the case of a normalized set of tables from an application using RDBMS, as shown in Figure 4. There are three tables in the relation. One stores the customer details, another one stores the order details, and the third one stores the order-line items. Assume that this is an operational table and the data size is huge. There is a one-to-many relation between the Customer table and the Order table. For every customer record in the Customer table, there may be zero or more order records in the Order table. There is a one-to-many relation between the Order table and the OrderItems table. For every order record in the Order table, there may be zero or more order item records in the OrderItems table.

Figure 4

Assume that the requirement is to create a Cassandra column family to generate a monthly customer order summary report. The report should contain one record for each customer containing the order total for that month. The Cassandra column family will look like the one given in the Figure 5.

In the Cassandra column family MonthlyCustomerOrder, a combination of CustomerId, OrderYear, and OrderMonth columns form the primary key. The CustomerId column will be the partition key. In other words, all the records for a given customer will be stored in one wide row of the column family.

Figure 5

Assuming that the key space is created using the scripts given Figure 3, the following scripts given here will create only the required column family and then insert one record. Filling in the data in this column family may be done on a real-time basis or as a batch process. Since the data required for filling the Cassandra column family is not readily available from the RDBMS tables, a preprocessing needs to be done to prepare the data that goes into this Cassandra column family:

USE PacktCDP1;
CREATE TABLE MonthlyCustomerOrder (
  CustomerId bigint,
  OrderYear int,
  OrderMonth int,
  CustomerName text static,
  OrderTotal float,
  PRIMARY KEY (CustomerId, OrderYear, OrderMonth)
  )
  WITH CLUSTERING ORDER BY (OrderYear DESC, OrderMonth DESC);

INSERT INTO MonthlyCustomerOrder (CustomerId, OrderYear, OrderMonth, CustomerName, OrderTotal) VALUES (1,2015,6,'George Thomas', 255.5);

The following script gives the details of how the row in the Cassandra column family is physically stored. The commands given here are to be executed in the Cassandra CLI interface:

USE PacktCDP1;
list MonthlyCustomerOrder;
Using default limit of 100
Using default cell limit of 100
RowKey: 1
=> (name=2015:6:, value=, timestamp=1434231618305061)
=> (name=2015:6:customername, value=476f657267652054686f6d6173, timestamp=1434231618305061)
=> (name=2015:6:ordertotal, value=437f8000, timestamp=1434231618305061)

1 Row Returned.
Elapsed time: 42 msec(s).

The CQL SELECT command given in Figure 6 gives the output in a human readable format:

SELECT * FROM MonthlyCustomerOrder;

Figure 6

Aggregation pattern

Design separate Cassandra column families to store the aggregated and summarized operational data. Aggregated data is used for various reporting and analytical purposes. Cassandra does not inherently support any joins between column families. Cassandra does not support the commonly seen SQL aggregation constructs such as GROUP BY, HAVING, and so on. Because of these constraints, it is better to preprocess the operational data to do the aggregation, summarization, and storage of the processed data in Cassandra column families. The lack of ability to do real-time aggregation using CQL can be converted to an advantage of using Cassandra, which is serving fast reads of already aggregated data and exploiting its highly scalable architecture.

Motivations/solutions

SQL on RDBMS provides a great deal of flexibility to store and retrieve data, apply computations, perform aggregations, and summarizations effortlessly. All these work fine as long as the data volume is manageable. The moment the data volume goes above the threshold and there is need to scale out to a distributed model, everything comes to a screeching halt. When the data is distributed across multiple RDBMS hosts, the queries and computations on top of it crawl even more. Because of these limitations, the separation of aggregated and operational data into separate tables became common practice. In this era of Internet of Things (IoT), even aggregated and summarized data starts overflowing within no time. In such situations, it is a good idea to move these already processed, aggregated, and summarized data into Cassandra column families. Cassandra can handle loads of such data and provide ultra-fast reads, even when the nodes are highly distributed across multiple racks and data centers.

Over a period of a couple of decades, there has been a clear change in the trend of how the data is aggregated. Originally, the RDBMS table data was processed through batch application processes, and the data was aggregated. Then, the traditional batch processes gave way to the divide-and-conquer methodologies such as Map/Reduce and Hadoop, to aggregate the data, but even then the aggregated data remained in separate RDBMS instances or in some distributed filesystems such as Hadoop Distributed File System (HDFS). The HDFS filesystem-based storage was good in some use cases, but the data access and reporting became difficult as the big data market was maturing in terms of the available tools and applications. Now, NoSQL data stores such as Cassandra offer good interoperability with other applications, and they can be used as a highly scalable data storage solution.

The drill-down capability has been a very common feature in many of the applications for a long time. In the user interfaces, a very high level aggregated and summarized data in the form of tables or graphs are presented. When the user clicks on a link, button, or section on the graph, the application presents the associated data that was used to create the aggregation or summarization. Typically, there will be multiple levels of these drill-downs and for providing that, the data must be aggregated at different levels. All of these operations are very computationally intensive, as well as expensive. Cassandra is a good fit to store these preprocessed data coming from the RDBMS tables. There are many data processing applications that make use of the multicore architecture of the modern computers and do the tasks asynchronously. Even though the RDBMS perform well when scaled up by making use of the multiple processing cores and huge memory seen in modern hardware, as mentioned earlier, the RDBMS don't perform well when it is scaled out especially where there are multiple table joins. Proper use of these data processing tools in conjunction with Cassandra will provide great value in storing the aggregated and summarized data.

Many organizations sell the data generated from their applications. Depending on the sensitivity of the data and the potential dangers of violating data protection and privacy laws, data aggregation becomes a mandatory requirement. Often, these aggregated data-for-sale need to be completely separated from the organization's live data. This data goes with totally different access controls, even at the level of hosting location. Cassandra is a good fit for this use case.

Marketing analytics use lots of aggregation. For example, in the case of retail transactions happening in a store, a third-party marketing analytics organization will not be given the individual transaction records. Instead, the transaction data is aggregated and it is ensured that all the personally identifiable data is masked or removed before being handed over for any analytical purposes. Consolidation, aggregation, and summarization are common needs here. Many organizations gather data from various marketing channels of the same organization itself to generate a common view of the marketing efforts. Many organizations find new avenues of creating new applications and value added services based out of these aggregated data. When new initiatives such as these come, separation of concerns plays an important role here and often business incubation teams or research and development units take these initiatives to the next level. These are the times the teams really think out of the box and start using new technologies. They completely move away from the legacy technologies to exploit the economies of scale. Exploration of new technologies happens when the legacy technologies have pain points, and when there is a need to reduce cost incurred due to specialized hardware requirements along with software license costs. Exploration with new technologies also happens when there is a totally new requirement that cannot be served by the ecosystem in use. Cassandra is a good fit in these use cases because many Internet scale applications use Cassandra heavily for heavy-duty data storage requirements running on commodity hardware, thus providing value for money.

Data visualization products use a lot of aggregation. Many such products are plagued by using too much data. Clutter drives users away and the information is lost in the abundant usage of data. Seeing all these problems, many other products are using aggregated data to visualize and provide drill down or other similar techniques in the visualization. Cassandra can be used store multilevel aggregated data in its column families.

Data warehousing solutions are fast moving away from RDBMS to NoSQL such as Cassandra. Data warehousing projects deal with huge amount of data and does lots of aggregation and summarization. When it comes to huge amount of data, scaling out beyond a single server is a mandatory requirement. Cassandra fits very well there. Data warehousing solutions also need to support various data processing tools. There are many drivers available in the market to connect to Cassandra. Many data processing and analytics tools such as Apache Spark work very well with Cassandra.

Online shopping sites generate lots of sale records. Many of them are still using RDBMS as their preferred data stores. It is practically impossible to generate a report, including all these sales records. So even in the basic reporting itself, aggregation plays a big role. These aggregated data is used for sales forecasting, trending, and undergoing further processing. NoSQL data stores such as Cassandra become the preferred choice of many to store these aggregated data.

Proliferation of data products mandated the need to process the data in a totally new way with lots of transformations from one format to another. Aggregation and summarization has become part of all these processes. Here, even the traditional SQL-based RDBMS fail because the processing needs are beyond SQL's limited capabilities. The RDBMS fails here on two counts. The first one being the inability to process data, and the second one being the inability to store the processed data that comes in totally different formats. Even Cassandra fails on the first one, but it scores better on the second one because it can store very sophisticated data types and can scale out to the roof. A detailed coverage on the Cassandra data types is coming in the upcoming chapters of this book.

Best practices

When doing aggregation and storing the aggregated data in Cassandra, care must be taken in the drill-down use cases. The drill-down use cases uses both the operational and aggregated data as Cassandra is coexisting with the existing RDBMS. When the operational data is coming from traditional RDBMS tables and the aggregated data coming from the Cassandra data stores, there are good chances of tight coupling of application components. If a design is not done properly and not thoughtfully crafted, the application maintenance will be a nightmare.

Note

The word aggregation is used in a totally different context in the NoSQL parlance. It is used to consolidate many related data items into one single unit and stored in the NoSQL data stores to store and retrieve as a single unit. Martin Fowler used this term in his article titled Aggregate Oriented Database and in that he uses the term aggregation in this way:

"Aggregates make natural units for distribution strategies such as sharding, since you have a large clump of data that you expect to be accessed together. An aggregate also makes a lot of sense to an application programmer. If you're capturing a screenful of information and storing it in a relational database, you have to decompose that information into rows before storing it away. An aggregate makes for a much simpler mapping - which is why many early adopters of NoSQL databases report that it's an easier programming model." When this type of aggregation is being used in Cassandra, care must be taken and don't store a big load of data items in as a blob.

The application logic must be carefully thought through in order to make sure that there is a proper sync-up between the operational and the aggregated data. If out of sync, this will become very obvious in the drill-down use cases because the aggregate record will show one value and the details will show a different value.

Tip

It is a good practice to store the data in Cassandra with proper structure always, even if the number of data items is large. It is comparatively easy to manage structured data than unstructured data.

Example

Let's take the case of a normalized set of tables from an application using RDBMS as shown in the Figure 7:

Figure 7

There are three tables in the relation. The first stores the customer details, the second one stores the order details, and the third one stores the order line items. Assume that this is an operational table and the data size is huge. There is a one-to-many relation between the Customer and Order table. For every customer record in the Customer table, there may be zero or more order records in the Order table. There is a one-to-many relation between the Order and OrderItems table. For every order record in the Order table, there may be zero or more order item records in the OrderItems table.

Assume that the requirement is to create an aggregation of the orders to have a monthly city order summary report. The Cassandra column family will look like the following screenshot:

Figure 8

In the CityWiseOrderSummary column family, a combination of City, OrderYear, and OrderMonth form the primary key. The City column will be the partition key. In other words, all the order aggregate data related to a given city will be stored in a single wide row.

Assuming that the key space is created using the scripts given in the example of de-Normalization pattern, the scripts given here will create only the required column family and then insert a couple of records. Filling of the data in this column family may be done on a real-time basis or as a batch process. Since the data required for filling the Cassandra column family is not readily available from the RDBMS tables, a preprocessing needs to be done to prepare the data that goes into this Cassandra column family:

USE PacktCDP1;
CREATE TABLE CityWiseOrderSummary 
(
  City text,
  OrderYear int,
  OrderMonth int,
  OrderTotal float,
  PRIMARY KEY (City, OrderYear, OrderMonth)
  )
  WITH CLUSTERING ORDER BY (OrderYear DESC, OrderMonth DESC);

INSERT INTO CityWiseOrderSummary (City, OrderYear, OrderMonth, OrderTotal) VALUES ('Leeds',2015,6,8500); 
INSERT INTO CityWiseOrderSummary (City, OrderYear, OrderMonth, OrderTotal) VALUES ('London',2015,6,8500);

The following script gives the details of how the row in the Cassandra column family is physically stored. The commands given here are to be executed in the Cassandra CLI interface:

USE PacktCDP1;
list CityWiseOrderSummary;
Using default limit of 100
Using default cell limit of 100
RowKey: London
=> (name=2015:6:, value=, timestamp=1434313711537934)
=> (name=2015:6:ordertotal, value=4604d000, timestamp=1434313711537934)

RowKey: Leeds
=> (name=2015:6:, value=, timestamp=1434313683491588)
=> (name=2015:6:ordertotal, value=4604d000, timestamp=1434313683491588)

2 Rows Returned.
Elapsed time: 5.17 msec(s).

The SELECT command given in Figure 9 gives the output in a human readable format:

SELECT * FROM CityWiseOrderSummary;

Figure 9

Just like the way the city-wise order summary is created, if we also need to create a state-wise order summary from the RDBMS tables given in Figure 7, a separate Cassandra column family will need to be created and the appropriate application processing needs to be done to fill in the data into that Cassandra column family.

Frank Dec 28, 2015

[Disclosure: I am the Technical Reviewer of this book]Cassandra Design Patterns is a superb introduction to Cassandra, for anyone approaching the subject from a traditional RDBMS background. All the important fundamental concepts are explained clearly and concisely, and in RDBMS terms.The book provides historical context surrounding the application design, which helps the reader to grasp the reasons behind Cassandra design decisions. I think this results in a more complete understanding of the subject for the reader. The author does this without getting bogged down or including irrelevant details.Particular attention is paid throughout the book to use cases where the application is deployed as a replacement for, or in operation with, existing non-distributed storage systems. The author methodically works over a number of categories of real-world uses, and works out each example completely. The task of applying these examples is discussed in each case.My favourite aspect of the book is the approach to the use cases: the author begins by discussing the relevant practical application, and then carefully deconstructs it down into a Cassandra setup. He makes sure to mention any obscure 'gotchas' or tweaks to the setup that may be necessary in certain circumstances.

Amazon Verified review

Christian Sep 25, 2016

At times unintillegible, non of the examples seem to be great examples really. This felt like a weekend project made it into a book. Seriously the worst technical book I've bought in years, I even requested a refund which I had never done before. stay away from this book. Even the english in the book feels awkward as hell, did anyone even review a draft of this garbage?

Cassandra Design Patterns: Build real-world, industry-strength data storage solutions with time-tested design methodologies using Cassandra , Second Edition

What do you get with eBook?

Cassandra Design Patterns

Chapter 1. Co-existence Patterns

Tip

A brief overview of Cassandra

Denormalization pattern

Motivations/solutions

Note

Best practices

Example

Tip

Reporting pattern

Motivations/solutions

Tip

Best practices

Example

Aggregation pattern

Motivations/solutions

Best practices

Note

Tip

Example

References

Summary

Page 1 of 7

Key benefits

Description

Who is this book for?

What you will learn

Product Details

What do you get with eBook?

Product Details

Frequently bought together

Table of Contents

Recommendations for you

Customer reviews

People who bought this also bought

About the author

FAQs

Cassandra Design Patterns: Build real-world, industry-strength data storage solutions with time-tested design methodologies using Cassandra , Second Edition

What do you get with eBook?

Contact Details

Billing Address

Key benefits

Description

Who is this book for?

What you will learn

Product Details

What do you get with eBook?

Contact Details

Billing Address

Product Details

Packt Subscriptions

Frequently bought together

Table of Contents

Recommendations for you

Customer reviews

People who bought this also bought

About the author

FAQs