Aggregation pattern
Design separate Cassandra column families to store the aggregated and summarized operational data. Aggregated data is used for various reporting and analytical purposes. Cassandra does not inherently support any joins between column families. Cassandra does not support the commonly seen SQL aggregation constructs such as GROUP BY
, HAVING
, and so on. Because of these constraints, it is better to preprocess the operational data to do the aggregation, summarization, and storage of the processed data in Cassandra column families. The lack of ability to do real-time aggregation using CQL can be converted to an advantage of using Cassandra, which is serving fast reads of already aggregated data and exploiting its highly scalable architecture.
Motivations/solutions
SQL on RDBMS provides a great deal of flexibility to store and retrieve data, apply computations, perform aggregations, and summarizations effortlessly. All these work fine as long as the data volume is manageable. The moment the data volume goes above the threshold and there is need to scale out to a distributed model, everything comes to a screeching halt. When the data is distributed across multiple RDBMS hosts, the queries and computations on top of it crawl even more. Because of these limitations, the separation of aggregated and operational data into separate tables became common practice. In this era of Internet of Things (IoT), even aggregated and summarized data starts overflowing within no time. In such situations, it is a good idea to move these already processed, aggregated, and summarized data into Cassandra column families. Cassandra can handle loads of such data and provide ultra-fast reads, even when the nodes are highly distributed across multiple racks and data centers.
Over a period of a couple of decades, there has been a clear change in the trend of how the data is aggregated. Originally, the RDBMS table data was processed through batch application processes, and the data was aggregated. Then, the traditional batch processes gave way to the divide-and-conquer methodologies such as Map/Reduce and Hadoop, to aggregate the data, but even then the aggregated data remained in separate RDBMS instances or in some distributed filesystems such as Hadoop Distributed File System (HDFS). The HDFS filesystem-based storage was good in some use cases, but the data access and reporting became difficult as the big data market was maturing in terms of the available tools and applications. Now, NoSQL data stores such as Cassandra offer good interoperability with other applications, and they can be used as a highly scalable data storage solution.
The drill-down capability has been a very common feature in many of the applications for a long time. In the user interfaces, a very high level aggregated and summarized data in the form of tables or graphs are presented. When the user clicks on a link, button, or section on the graph, the application presents the associated data that was used to create the aggregation or summarization. Typically, there will be multiple levels of these drill-downs and for providing that, the data must be aggregated at different levels. All of these operations are very computationally intensive, as well as expensive. Cassandra is a good fit to store these preprocessed data coming from the RDBMS tables. There are many data processing applications that make use of the multicore architecture of the modern computers and do the tasks asynchronously. Even though the RDBMS perform well when scaled up by making use of the multiple processing cores and huge memory seen in modern hardware, as mentioned earlier, the RDBMS don't perform well when it is scaled out especially where there are multiple table joins. Proper use of these data processing tools in conjunction with Cassandra will provide great value in storing the aggregated and summarized data.
Many organizations sell the data generated from their applications. Depending on the sensitivity of the data and the potential dangers of violating data protection and privacy laws, data aggregation becomes a mandatory requirement. Often, these aggregated data-for-sale need to be completely separated from the organization's live data. This data goes with totally different access controls, even at the level of hosting location. Cassandra is a good fit for this use case.
Marketing analytics use lots of aggregation. For example, in the case of retail transactions happening in a store, a third-party marketing analytics organization will not be given the individual transaction records. Instead, the transaction data is aggregated and it is ensured that all the personally identifiable data is masked or removed before being handed over for any analytical purposes. Consolidation, aggregation, and summarization are common needs here. Many organizations gather data from various marketing channels of the same organization itself to generate a common view of the marketing efforts. Many organizations find new avenues of creating new applications and value added services based out of these aggregated data. When new initiatives such as these come, separation of concerns plays an important role here and often business incubation teams or research and development units take these initiatives to the next level. These are the times the teams really think out of the box and start using new technologies. They completely move away from the legacy technologies to exploit the economies of scale. Exploration of new technologies happens when the legacy technologies have pain points, and when there is a need to reduce cost incurred due to specialized hardware requirements along with software license costs. Exploration with new technologies also happens when there is a totally new requirement that cannot be served by the ecosystem in use. Cassandra is a good fit in these use cases because many Internet scale applications use Cassandra heavily for heavy-duty data storage requirements running on commodity hardware, thus providing value for money.
Data visualization products use a lot of aggregation. Many such products are plagued by using too much data. Clutter drives users away and the information is lost in the abundant usage of data. Seeing all these problems, many other products are using aggregated data to visualize and provide drill down or other similar techniques in the visualization. Cassandra can be used store multilevel aggregated data in its column families.
Data warehousing solutions are fast moving away from RDBMS to NoSQL such as Cassandra. Data warehousing projects deal with huge amount of data and does lots of aggregation and summarization. When it comes to huge amount of data, scaling out beyond a single server is a mandatory requirement. Cassandra fits very well there. Data warehousing solutions also need to support various data processing tools. There are many drivers available in the market to connect to Cassandra. Many data processing and analytics tools such as Apache Spark work very well with Cassandra.
Online shopping sites generate lots of sale records. Many of them are still using RDBMS as their preferred data stores. It is practically impossible to generate a report, including all these sales records. So even in the basic reporting itself, aggregation plays a big role. These aggregated data is used for sales forecasting, trending, and undergoing further processing. NoSQL data stores such as Cassandra become the preferred choice of many to store these aggregated data.
Proliferation of data products mandated the need to process the data in a totally new way with lots of transformations from one format to another. Aggregation and summarization has become part of all these processes. Here, even the traditional SQL-based RDBMS fail because the processing needs are beyond SQL's limited capabilities. The RDBMS fails here on two counts. The first one being the inability to process data, and the second one being the inability to store the processed data that comes in totally different formats. Even Cassandra fails on the first one, but it scores better on the second one because it can store very sophisticated data types and can scale out to the roof. A detailed coverage on the Cassandra data types is coming in the upcoming chapters of this book.
Best practices
When doing aggregation and storing the aggregated data in Cassandra, care must be taken in the drill-down use cases. The drill-down use cases uses both the operational and aggregated data as Cassandra is coexisting with the existing RDBMS. When the operational data is coming from traditional RDBMS tables and the aggregated data coming from the Cassandra data stores, there are good chances of tight coupling of application components. If a design is not done properly and not thoughtfully crafted, the application maintenance will be a nightmare.
Note
The word aggregation is used in a totally different context in the NoSQL parlance. It is used to consolidate many related data items into one single unit and stored in the NoSQL data stores to store and retrieve as a single unit. Martin Fowler used this term in his article titled Aggregate Oriented Database and in that he uses the term aggregation in this way:
"Aggregates make natural units for distribution strategies such as sharding, since you have a large clump of data that you expect to be accessed together. An aggregate also makes a lot of sense to an application programmer. If you're capturing a screenful of information and storing it in a relational database, you have to decompose that information into rows before storing it away. An aggregate makes for a much simpler mapping - which is why many early adopters of NoSQL databases report that it's an easier programming model." When this type of aggregation is being used in Cassandra, care must be taken and don't store a big load of data items in as a blob.
The application logic must be carefully thought through in order to make sure that there is a proper sync-up between the operational and the aggregated data. If out of sync, this will become very obvious in the drill-down use cases because the aggregate record will show one value and the details will show a different value.
Tip
It is a good practice to store the data in Cassandra with proper structure always, even if the number of data items is large. It is comparatively easy to manage structured data than unstructured data.
Example
Let's take the case of a normalized set of tables from an application using RDBMS as shown in the Figure 7:
There are three tables in the relation. The first stores the customer details, the second one stores the order details, and the third one stores the order line items. Assume that this is an operational table and the data size is huge. There is a one-to-many relation between the Customer
and Order
table. For every customer record in the Customer
table, there may be zero or more order records in the Order
table. There is a one-to-many relation between the Order
and OrderItems
table. For every order record in the Order
table, there may be zero or more order item records in the OrderItems
table.
Assume that the requirement is to create an aggregation of the orders to have a monthly city order summary report. The Cassandra column family will look like the following screenshot:
In the CityWiseOrderSummary
column family, a combination of City
, OrderYear
, and OrderMonth
form the primary key. The City
column will be the partition key. In other words, all the order aggregate data related to a given city will be stored in a single wide row.
Assuming that the key space is created using the scripts given in the example of de-Normalization pattern, the scripts given here will create only the required column family and then insert a couple of records. Filling of the data in this column family may be done on a real-time basis or as a batch process. Since the data required for filling the Cassandra column family is not readily available from the RDBMS tables, a preprocessing needs to be done to prepare the data that goes into this Cassandra column family:
USE PacktCDP1; CREATE TABLE CityWiseOrderSummary ( City text, OrderYear int, OrderMonth int, OrderTotal float, PRIMARY KEY (City, OrderYear, OrderMonth) ) WITH CLUSTERING ORDER BY (OrderYear DESC, OrderMonth DESC); INSERT INTO CityWiseOrderSummary (City, OrderYear, OrderMonth, OrderTotal) VALUES ('Leeds',2015,6,8500); INSERT INTO CityWiseOrderSummary (City, OrderYear, OrderMonth, OrderTotal) VALUES ('London',2015,6,8500);
The following script gives the details of how the row in the Cassandra column family is physically stored. The commands given here are to be executed in the Cassandra CLI interface:
USE PacktCDP1; list CityWiseOrderSummary; Using default limit of 100 Using default cell limit of 100 RowKey: London => (name=2015:6:, value=, timestamp=1434313711537934) => (name=2015:6:ordertotal, value=4604d000, timestamp=1434313711537934) RowKey: Leeds => (name=2015:6:, value=, timestamp=1434313683491588) => (name=2015:6:ordertotal, value=4604d000, timestamp=1434313683491588) 2 Rows Returned. Elapsed time: 5.17 msec(s).
The SELECT
command given in Figure 9 gives the output in a human readable format:
SELECT * FROM CityWiseOrderSummary;
Just like the way the city-wise order summary is created, if we also need to create a state-wise order summary from the RDBMS tables given in Figure 7, a separate Cassandra column family will need to be created and the appropriate application processing needs to be done to fill in the data into that Cassandra column family.