You're reading from Real-Time Big Data Analytics Design, process, and analyze large sets of complex data in real time

Product type Paperback

Published in Feb 2016

Publisher

ISBN-13 9781784391409

Length 326 pages

Edition 1st Edition

Languages

Java

Tools

Apache Spark

Concepts

Big Data

Author (1):

Shilpi Saxena

View More author details

Table of Contents (12) Chapters

Preface

1. Introducing the Big Data Technology Landscape and Analytics Platform FREE CHAPTER

2. Getting Acquainted with Storm

3. Processing Data with Storm

4. Introduction to Trident and Optimizing Storm Performance

5. Getting Acquainted with Kinesis

6. Getting Acquainted with Spark

7. Programming with RDDs

8. SQL Query Engine for Spark – Spark SQL

9. Analysis of Streaming Data Using Spark Streaming

10. Introducing Lambda Architecture

Index

Distributed databases (NoSQL)

We have discussed the paradigm shift from data to computation to the paradigm of computation to data in case of Hadoop. We understand on the conceptual level how to harness the power of distributed computation. The next step is to apply the same to database level in terms of having distributed databases.

In very simple terms, a database is actually a storage structure that lets us store the data in a very structured format. It can be in the form of various data structural representations internally, such as flat files, tables, blobs, and so on. Now when we talk about a database, we generally refer to single/clustered server class nodes with huge storage and specialized hardware to support the operations. So, this can be envisioned as a single unit of storage controlled by a centralized control unit.

Distributed database, on the contrary, is a database where there is no single control unit or storage unit. It's basically a cluster of homogenous/heterogeneous nodes, and the data and the control for execution and orchestration is distributed across all nodes in the cluster. So to understand it better, we can use an analogy that, instead of all the data going into a single huge box, now the data is spread across multiple boxes. The execution of this distribution, the bookkeeping and auditing of this data distribution, and the retrieval process are managed by multiple control units. In a way, there is no single point of control or storage. One important point is that these multiple distributed nodes can exist physically or virtually.

Note

Do not relate this to the concept of parallel systems, where the processors are tightly coupled and it all constitutes a single database system. A distributed database system is a relatively loosely coupled entity that shares no physical components.

Now we understand the differentiating factors for distributed databases. However, it is necessary to understand that, due to their distributed nature, these systems have some added complexity to ensure correctness and the accuracy of the day. There are two processes that play a vital role:

Replication: This is tracked by a special component of distributed databases. This piece of software does the tedious job of bookkeeping for all the updates/additions/deletions that are being made to the data. Once all changes are logged, this process updates so that all copies of the data look the same and represent the state of truth.
Duplication: The popularity of distributed databases is due to the fact that they don't have a single point of failure and there is always more than one copy of data available to deal with a failure situation if it happens. The process that copies one instance of data to multiple locations on the cluster generally executes at some defined interval.

Both these processes ensure that at a given point in time, more than one copy of data exists in the cluster, and all the copies of the data have the same representation of the state of truth for the data.

A NoSQL database environment is a non-relational and predominately distributed database system. Its clear advantage is that it facilitates the rapid analysis of extremely high-volume, disparate data types. With the advent of Big Data, NoSQL databases have become the cheap and scalable alternative to traditional RDBMS. The USPs they have to offer are availability and fault tolerance, which are big differentiating factors.

NoSQL offers a flexible and extensible schema model with added advantages of endless scalability, distributed setup, and the liberty of interfacing with non-SQL interfaces.

We can distinguish NoSQL databases as follows:

Key-value store: This type of database belongs to some of the least complex NoSQL options. Its USP is the design that allows the storage of data in a schemaless way. All the data in this store contains an index key and an associate value (as the name suggests). Popular examples of this type of database are Cassandra, DynamoDB, Azure Table Storage (ATS), Riak, Berkeley DB, and so on.
Column store or wide column store: This is designed for storing the data in rows and its data in data tables, where there are columns of data rather than rows of data, like in a standard database. They are the exact opposite of row-based databases, and this design is highly scalable and offers very high performance. Examples are HBase and Hypertable.
Document database: This is an extension to the basic idea of a key-value store where documents are more complex and elaborate. It's like each document has a unique ID associated with it. This ID is used for document retrieval. These are very widely used for storing and managing document-oriented information. Examples are MongoDB and CouchDB.
Graph database: As the name suggests, it's based on the graph theory of discreet mathematics. It's well designed for data where relationships can be maintained as a graph and elements are interconnected based on relations. Examples are Neo4j, polyglot, and so on.

The following table lays out the key attributes and certain dimensions that can be used for selecting the appropriate NoSQL databases:

Column 1: This captures the storage structure for the data model.
Column 2: This captures the performance of the distributed database on the scale of low, medium, and high.
Column 3: This captures the ease of scalability of the distributed database on the scale of low, medium, and high. It notes how easily the system can be scaled in terms of capacity and processing by adding more nodes to the cluster.
Column 4: Here we talk about the scale of flexibility of use and the ability to cater to diverse structured or unstructured data and use cases.

Column 5: Here we talk about how complex it is to work with the system in terms of the complexity of development and modeling, the complexity of operation and maintainability, and so on.

Data model	Performance	Scalability	Flexibility	Complexity
Key-value store	High	High	High	None
Column Store	High	High	Moderate	Low
Document Store	High	Variable (high)	High	Low
Graph Database	Variable	Variable	High	High

Advantages of NoSQL databases

Let's have a look at the key reasons for adopting an NoSQL database over a traditional RDBMS. Here are the key drivers that have been attributed to the shift:

Advent and growth of Big Data: This is one of the prime attribute forces driving the growth and shift towards use of NoSQL.
High availability systems: In today's highly competitive world, downtime can be deadly. The reality of business is that hardware failures will occur, but NoSQL database systems are built over a distributed architecture so there is no single point of failure. They also have replication to ensure redundancy of data that guarantees availability in the event of one or more nodes being down. With this mechanism, the availability across data centers is guaranteed, even in the case of localized failures. This all comes with guaranteed horizontal scaling and high performance.
Location independence: This refers to the ability to perform read and write operations to the data store regardless of the physical location where that input-output operation actually occurs. Similarly, we have the ability to have any write percolated out from that location. This feature is a difficult wish to make in the RDBMS world. This is a very handy tool when it comes to designing an application that services customers in many different geographies and needs to keep data local for fast access.
Schemaless data models: One of the major motivator for move to a NoSQL database system from an old-world relational database management system (RDBMS) is the ability to handle unstructured data and it's found in most NoSQL stores. The relational data model is based on strict relations defined between tables, which themselves are very strict in definition by a determined column structure. All of this then gets organized in a schema. The backbone of RDBMS is structure and it's the biggest limitation as this makes it fall short for handling and storing unstructured data that doesn't fit into strict table structure. A NoSQL data model on the contrary doesn't have any structure and it's flexible to fit in any form, so it's called schemaless. It's like one size fits all, and it's able to accept structured, semistructured, or unstructured data. All this flexibility comes along with a promise of low-cost scalability and high performance.

Choosing a NoSQL database

When it comes to making a choice, there are certain factors that can be taken into account, but the decision is still more use-case-driven and can vary from case to case. Migration of a data store or choosing a data store is an important, conscious decision, and should be made diligently and intelligently depending on the following factors:

Input data diversity
Scalability
Performance
Availability
Cost
Stability
Community